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ABSTRACT 


z' 

The  application  of  nearest  neighbor  rules  arri  other  local  rules  to  the 

problem  of  estimating  a parameter  6 is  investigated.  It  is  assumed  that 

a loss  function  L,  an  observed  random  vector  X,  and  data  consisting  of 

a sequence  of  independent  random  vectors  (X,  , 0, ),...,  (X  ,0  ) with  the 

li  n n 

same  distribution  as  (X,  0)are  given.  Conditions  are  shown  for  which, 

if  R*  denotes  the  Bayes  risk  (the  minimum  expected  loss  possible),  then 

the  conditional  expected  loss  of  the  k-nearest  neighbor  rule,  conditioned 

on  the  data,  converges  to  (1+  l/k)R*  for  squared-error  loss  functions. 

For  k -nearest  neighbor  rules  where  k -a>  and  k /n-*0,  conditions  are 
n n n 

given  under  which  the  rules  are  asymptotically  optimal. 

In  addition,  methods  of  estimating  the  conditional  risk  of  a rule 
with  a particular  data  set  are  investigated.  For  a class  of  rules  called 
local  rules,  the  performance  of  two  different  estimates  of  the  risk  is 
bounded  independently  of  the  underlying  distribution  of  (X,  0).  This 
enables  the  statistician  to  construct  confidence  intervals  for  the  risk 
of  the  rule  and  data  he  is  using,  without  knowledge  of  the  distribution 


of  (X,  0). 
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I.  INTRODUCTION 


!• 1 A Description  of  the  Nonparametric  Estimation  Problem 

The  estimation  problem  to  be  considered  can  be  loosely  described 
as  the  problem  of  determining  how  to  guess  the  value  of  an  unknown  para- 
meter e,  when  the  only  available  information  concerning  the  value  of  9 
is  contained  in  i)  an  observation  X which  is  related  to  6 in  some  proba- 
bilistic sense,  and  ii)  some  form  of  information  concerning  the  probability 


structure  underlying  the  relationship  between  X and  e.  A simple  example 
of  this  type  of  problem  is  the  question  of  determining  how  to  estimate 
the  weight  of  an  individual  selected  at  random  from  some  population,  when 
the  individual's  height  and  certain  statistics  concerning  the  heights  and 
weights  of  members  of  the  population  are  known.  A more  interesting  and 
realistic  example  could  be  the  problem  of  determining  how  to  estimate  the 
production  to  be  expected  from  an  oil  well  when  the  available  information 
is  in  the  form  of  measurements  such  as  pressure  and  temperature  within 
the  well,  and  knowledge  of  past  experience  concerning  the  relationship 
between  such  measurements  and  production  for  preceding  wells. 

In  order  to  complete  the  description  of  the  estimation  problem, 
the  relationship  between  the  observation  X and  the  parameter  e must  be 
specified.  In  addition,  some  method  of  comparing  the  performance  of 
various  estimators  must  be  determined.  In  the  analysis  to  follow,  it 
will  be  assumed  that  (X,  6)  is  a random  vector  with  joint  distribution 


2 


function  F(x,  0) . (This  formulation  is  known  as  the  Bayesian  estima- 
tion problem.)  We  will  assume  that  X takes  values  in  IRd  and  e 
takes  values  in  1RP.  We  will  also  assume  the  knowledge  of  a loss 
function  L defined  on  IRP  x 1RP  so  that  L(0,  0)  is  the  loss  incurred 
when  0 is  the  true  value  of  the  parameter  and  9 is  the  estimate.  If  we 
define  an  estimation  rule  as  a function  8 : lRd-»  IRP  and  if  we  intend  to 

A 

use  6 to  estimate  the  parameters  for  a number  of  observations,  then  it 
is  natural  to  consider  E[L(9,  0(X)) } as  a performance  criterion  for  6. 
(E{L(9,  © (X) ) } is  known  as  the  risk  associated  with  the  estimation  rule 

A A 

0.)  The  statistician  will  be  interested  in  choosing  a function  9 to 

A 

minimize  E{L(9, 0(X) ) } since  the  average  loss  incurred  when  9 is  used 
on  a large  number  of  observations  will  be  near  this  value. 

From  this  discussion  it  is  clear  that  the  amount  and  type  of 
information  available  to  the  statistician  concerning  the  distribution 

A 

F(x,9)  must  be  a crucial  factor  in  the  determination  of  the  estimator  0. 
We  will  briefly  discuss  two  possible  degrees  of  knowledge  of  F(x,  0) 
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R*=  E[E{L(e,e*)  |x]]  = E{L(e,e*)}  d.2) 

is  known  as  the  Bayes  risk  and  is  clearly  less  than  or  equal  to  the  risk, 
E{L(e.  0)  },  for  any  other  estimation  rule.  R*  is  the  smallest  possible 
risk  and  any  estimation  rule  which  has  risk  R*  is  called  an  optimal  rule. 

In  general,  in  the  absence  of  fairly  specific  information  concerning 
F(x,0),  an  optimal  rule  cannot  be  constructed. 

There  are  any  number  of  problems  in  which  partial  knowledge  of 
F(x,  0)  is  available.  The  one  to  be  discussed  throughout  the  remainder 
of  this  report  is  known  as  the  nonparametrlc  estimation  problem,  in  which 
the  only  information  concerning  F(x,  6)  is  contained  in  a data  sequence 
(XX  ' V ' • • • , (xn  * en)  of  Independent,  identically  distributed  random 
vectors,  with  distribution  F(x,  0).  A nonparametrlc  estimation  rule  e 
will  be  defined  as  a mapping 

0:  Rdx(FdxFp)n-Rp.  (1. 3) 

(We  do  not  exclude  the  possibility  that  0 may  incorporate  randomization, 
as  will  frequently  be  the  case.)  An  example  of  such  an  estimation  rule 
is  the  nearest  neighbor  rule.  This  rule  was  first  discussed  in  the  context 
of  estimation  by  Cover  [1],  but  it  was  originally  presented  as  a discrimi- 
nation procedure  in  a pair  of  reports  by  Fix  and  Hodges  [2,3].  The 
nearest  neighbor  rule  is  quite  obvious  in  concept.  If  X is  an  observation 
for  which  0 is  to  be  estimated,  the  rule  chooses  as  its  estimate  the  value 
0j  associated  with  the  observation  X^  from  the  data  set  which  is  closest  to  X. 
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A simple  generalization  of  the  nearest  neighbor  rule  is  the  k -nearest 
neighbor  rule.  This  rule  uses  as  its  estimate  of  © the  average  of  the 
parameters  of  the  k-closest  observations  to  X from  the  data  set.  Another 
example  of  a nonparametric  estimation  rule  is  one  in  which  the  data  set 
is  used  to  estimate  F(x,  ©).  If  F(x,  0)  is  the  estimate  of  F(x,  0),  then 

A A 

0(x)  is  chosen  to  minimize  E{L(0,  0(x))/X  = x}  where  the  expectation  is 

A A 

taken  with  respect  to  the  distribution  F.  The  rule  0 thus  constructed 

A 

would  actually  be  optimal  if  F(x,  0)  were  equal  to  F(x,  e) . 


1.2  Evaluation  of  Estimation  Rules 


The  process  of  selecting  and  evaluating  an  estimation  rule  should 
consist  basically  of  examining  the  following  three  factors: 

i)  cost  of  implementation  and  use 

ii)  asymptotic  performance 

iii)  finite  sample  performance. 

The  question  of  implementation  cost  will  not  be  considered  in  any  detail 
here.  Both  of  the  remaining  factors  have  strong  significance  for  the 
statistician. 

Asymptotic  performance  is  often  called  "large  sample"  performance, 
but  the  question  of  how  large  n (the  size  of  the  data  set)  should  be  so  that 
a rule  approaches  its  large  sample  performance  is  a difficult  one.  The 
adequacy  of  a data  set  depends  heavily  on  the  distribution  F(x,  0) , and 


it  requires  but  a modicum  of  effort  to  imagine  problems  where  fifty 
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ft  I ,] 

I ] 

samples  is  either  a small  or  a large  data  set.  In  spite  of  this,  if  the  sta- 
tistician has  any  hope  at  all  of  obtaining  a data  set  which  is  adequate  for 
his  problem,  he  will  be  interested  in  knowing  how  the  large  sample  perfor- 
mance of  his  rule  compares  with  R*,  the  performance  of  an  optimal  rule. 

I t 

In  cases  where  there  is  reason  to  believe  that  an  adequate  data  set  is 
available,  the  statistician's  knowledge  of  asymptotic  behavior  alone  may 
enable  him  to  choose  one  rule  over  another.  Hence  the  importance  of 
knowledge  concerning  the  asymptotic  behavior  of  the  risk  of  a rule. 

The  importance  of  knowledge  concerning  the  finite  sample  per- 
formance of  a rule  with  the  particular  data  set  which  is  currently  available 
is  even  more  readily  apparent.  For  example,  suppose  that  a statistician 
is  using  a rule  for  which  he  knows  the  risk  converges  in  some  sense  to 
R*.  Without  knowledge  of  F(x,6),  however,  the  statistician  does  not 
know  the  value  of  R*,  so  that  even  if  his  data  set  is  quite  large,  he  does 
not  know  how  well  his  rule  will  perform.  In  fact,  even  though  the  rule 
should  do  as  well  as  possible  in  the  large  sample  case  , its  performance 
with  the  data  set  available  may  be  unacceptably  bad  if  either  R*  is  large 
or  the  data  set  is  not  adequate.  As  another  example  of  the  need  for  a 
good  estimate  of  finite  sample  performance,  consider  the  case  where 
additional  data  may  be  acquired  at  some  significant  cost.  In  this  case 
the  statistician's  knowledge  of  current  performance  may  indicate  that 


there  is  no  need  to  gather  additional  data , or  it  may  enable  him  to 
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measure  the  performance  improvement  achieved  by  expanding  the  data  set, 

l 

so  that  he  can  determine  if  the  improvement  is  worth  the  cost.  In  order 
to  be  of  benefit  in  solving  these  problems , the  estimate  of  finite  sample 
performance  should  be  a function  only  of  the  rule  and  the  data  set  to  be 
used. 

1.3  Discussion  of  Results 

Most  of  the  results  presented  are  for  the  nearest  neighbor  rule 
and  variations.  These  rules  remain  among  the  most  interesting  solutions 
to  the  estimation  problem  because  of  their  simplicity  and  because  of  the 
strength  of  the  results  which  may  be  obtained  for  them.  The  major 
criticisms  of  the  nearest  neighbor  rules  are  that  the  entire  data  set  must 
be  stored,  and,  for  large  data  sets  in  a many  dimensional  observation 
space . the  computation  required  to  find  the  nearest  neighbors  cin  be 
significant.  But  the  fact  that  most  of  the  information  contained  in  the 
data  about  an  observation  X is  inherently  contained  in  the  observations 
near  X implies  that  any  rule  which  would  take  full  advantage  of  the  data 
set  must  be  generally  subject  to  the  same  criticisms.  The  nearest  neigh- 
bor rules  continue  to  serve  as  a benchmark  against  which  other  estimation 
rules  should  be  compared. 

In  order  to  discuss  the  results,  some  notation  to  be  used  in  the 
following  pages  will  be  described . The  conditional  n sample  risk  for 
an  estimation  rule  0:  IRd  x 0RdxlRP)n-»IRP  conditioned  on  the  data 
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(X, , 0. ) , . . . , (X  , 0 ) will  be  denoted 
■ii  n n 

- i 0^) » . . . / (Xn,  0n)|  . (1.4) 

Note  that  the  dependence  of  9 = e(X,  (X. , 0.) (X  , 0 ))  on  X and  the 

ii  n n 

data  is  not  explicit  in  this  notation.  is  clearly  a random  variable 
since  it  is  readily  seen  to  be  a measurable  function  of  the  data . 

In  Cover's  [l]  original  paper  on  estimation  with  nearest  neighbor 
rules  it  was  shown  that 

R*  *£  lim  sup  E £ 2R*  (j  _ 5) 

for  bounded  metric  loss  functions  with  certain  continuity  assumptions  on 

E(L(0 , eQ)/X,  0Q).  A metric  loss  function  is  a function  L:  IRPxIRP-*  IR  which 

is  a metric  on  IRP  (Cover  actually  claims  that  EL  -»  R,  but  his  proof 

n 

fails  to  demonstrate  this.)  If  [|.  (|  denotes  a norm  onIRp  then  the  squared - 
error  loss  function  is  defined  as 

L(0r02)  = II  0J  - e2||2  . (1.6) 

For  the  case  of  the  squared-error  loss  function  with  the  k-nearest  neighbor 
rule  Cover  gives  conditions  for  which 


EL  -*  R 
n 


(1.7) 


where 


(1.8) 


This  result  provides  bounds  on  R,  which  can  be  interpreted  loosely  as 
the  average  large  semple  risk,  where  the  averaging  is  performed  over 


I 


all  large  data  sets.  This  result,  although  of  Interest  to  the  statistician, 
does  not  really  address  itself  to  the  questions  concerning  asymptotic 
performance  which  were  pointed  out  as  being  most  interesting  in  section 
1.2.  The  reason  is  that  the  statistician  does  not  have  a large  number  of 
large  data  sets  over  which  he  will  average  the  performance  of  his  rule. 
The  question  concerning  asymptotic  performance  which  is  really  of 
interest  is  what  happens  when  a single  data  set  is  made  large.  This 
question  is  answered  in  Chapter  II  in  which  it  is  shown  that  for  the  k 
nearest  neighbor  rule  with  the  squared-error  loss  function 

-*  R in  probability  (1.9) 

where  R satisfies  (1.8).  This  result  is  analogous  to  results  obtained  by 
Wagner  [4]  for  the  discrimination  problem.  In  addition,  for  metric  loss 
functions  conditions  are  given  under  which,  for  ail  e > 0 


P {Ln  - 2R*  a e } -»  0 


(1.10) 


It  is  also  shown  that  if  k /n-*0  and  k ■•*<*>,  then 

n n 


Ln  -»  R*  in  probability 


(1.11) 


for  the  k -nearest  neighbor  rule  with  the  squared-error  loss  function, 
n 


A rule  which  satisfies  (1.11)  is  called  asymptotically  optimal. 

Chapter  III  is  primarily  concerned  with  the  evaluation  of  finite 


J 
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sample  performance  for  estimation  rules.  (Reference  is  made  to  Toussaint's 

[5]  survey  of  techniques  employed  for  this  purpose  in  discrimination 

problems.  Most  of  the  same  methods  can  be  applied  to  estimation 

problems.)  This  problem  can  be  restated  in  terms  of  using  the  given 

data  set  to  estimate  L . Two  estimates  of  L are  shown  to  have  the 

n n 

property  that  for  certain  classes  of  estimation  rules 

P { I L - L I s 
L 1 n n 1 

can  be  bounded  independently  of  the  distribution  F(x,  e).  The  bound 
obtained  for  the  deleted  estimate  decreases  at  rate  1/n,  while  the 
bound  for  the  holdout  estimate  decreases  at  rate  l/*Jn.  These  bounds 
provide  a solution  to  the  need  for  methods  of  evaluating  finite  sample 
performance  by  allowing  the  statistician  to  construct  confidence  inter- 
vals for  Ln  which  are  independent  of  the  underlying  distribution  F(x,  e) . 

In  Chapters  II  and  III,  the  main  emphasis  is  on  providing  solu- 
tions to  the  problems  discussed  in  1.2  concerning  the  evaluation  of  esti- 
mation rules.  Chapter  IV  consists  of  a discussion  of  the  results  of  a 
simulation  study  performed  on  the  deleted  estimate  and  the  holdout  estimate 


of  Ln>  This  study  was  performed  in  order  to  gain  some  experimental 
verification  of  the  theoretical  results  of  Chapter  III. 


II.  ASYMPTOTIC  PERFORMANCE  OF  NEAREST 
NEIGHBOR  RULES  IN  ESTIMATION 

II.  1 Preliminary  Remarks 

This  chapter  will  present  results  concerning  the  convergence  of 

the  conditional  n sample  risk  for  k-nearest  neighbor  rules  and  k -nearest 

n 

neighbor  rules.  In  order  to  obtain  these  results  it  has  been  necessary 
to  restrict  the  class  of  loss  functions  to  squared-error  loss  functions 
and  metric  loss  functions.  A squared  error  loss  function  satisfies 

L(ere2)  = ||e1  - e2|]2  (2.1) 

where  ||  * ((  is  a norm  on  the  space  IRP.  A metric  loss  function  is  one  in 

which  L(01 , e2)  is  a metric  on  the  space  IRP.  These  two  types  of  loss 

functions  cover  a broad  range  of  practical  applications . 

The  types  of  nearest  neighbor  rules  to  be  discussed  will  be 

chiefly  those  which  weight  the  k-nearest  neighbors  equally  in  forming 

the  estimate.  Such  rules  can  be  described  as  follows.  Let  D = (X  , e ) , 

n 11 

....  (X^,  8^)  be  the  data  sequence  consisting  of  n independent  identi- 
cally distributed  random  vectors  with  distribution  F(x,  0) , where  for  each 
i,  Xi  takes  values  in  JRd  and  0.  takes  values  in  the  parameter  space  ]RP. 
Then,  if  (X,  0)  has  distribution  F(x,  0),  the  k-nearest  neighbor  rule  estimate 
of  0 is 


1 
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L 

I 

I- 
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where  9^  is  the  parameter  associated  with  the  ith  closest  observation 
to  X from  D^,  d. stance  being  measured  in  the  usual  Euclidean  metric 
on  lRd , or  any  other  metric  on  ]Rd . The  k^-nearest  neighbor  rule  is 
obtained  by  simply  allowing  k to  be  a function  of  n in  (2.2). 

Section  II. 2 will  contain  theorems  concerning  the  convergence  of 
the  conditional  risk  for  the  single  (k=l)  nearest  neighbor  rule.  In  section 
II. 3 these  results  will  be  generalized  to  k >1,  and  in  section  II. 4 results 
will  be  shown  for  ^-nearest  neighbor  rules.  Section  II. 5 will  contain  a 
brief  discussion  of  nearest  neighbor  rules  which  do  not  weight  the  k- 
nearest  neighbors  equally  in  forming  the  estimate  of  9. 


II. 2 The  Single-Nearest  Neighbor  Rule 

The  following  lemma  will  be  used  extensively  in  this  chapter. 

We  first  define  the  support  of  a probability  density  function  as  the 
smallest  closed  set  E such  that 

^f(x)dx=l  . (2.3) 

Lemma  _1:  Let  Xj  , , . . . ,Xr  be  a sequence  of  independent  identically 

distributed  random  vectors  taking  values  in  IRd,  and  let  f be  a probability 
daMity  function  on  IRd  corresponding  to  the  distribution  of  X1 . 

Let  p(* , •)  be  a metric  on  lRd,  let  E c lRd  be  the  support  of  f,  and  let 


K c E be  compact  with  the  metric  p. 
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Let  Vj  ,n  = jx  € ^ p(Xj  ~ x)  < p(X1  - x)  , 1 s i £ nj  . 
Then 

r = maxi  sup  p(x,y)  -♦  0 w.p.l 
n lsjsn(x,y€KV.  „ 1 


(2.4) 


(2.5) 


X = max  (p,KV.  -*  0 w.p.l 
n lsjsn!  J,n» 


where  p,  denotes  Lebesgue  measure  on  1R  . 


Proof:  Note  that  if  r -*0,  then  X -♦  0 . Since  r is  monotonically 
n n n 

decreasing  in  n,  it  suffices  to  show  that  for  all  e > 0 


P {rn  ^ e } -»  0 . 


(2.6) 


(2.7) 


Let  S ,.(y)  denote  the  open  sphere  in  OR  , p)  with  center  y and  radius 
e /4 

e/4.  Since  K is  compact,  there  exists  a finite  collection  of  points 
£yl' ym]  c K such  that 


KCl=1S‘/4<yl) 


(2.8) 


Now,  suppose  that  for  each  i,  the  sphere  se//4(yi)  contains  at  least  one 

of  the  random  vectors  X^. . Let  x€  K so  that  x€Sg^(yj)  for  some  i, 

1 i i s m.  Since  S ,.(y,)  contains  one  of  the  X, , the  distance  from  x 
e/4  l j 

to  its  nearest  neighbor  from  X^  , . . . ,X^  must  be  less  than  e/2  . Since 

this  is  clearly  true  for  all  x€  K we  have  n c se/2^Xj^ ' 1 s ^ s n' 

implying  that  r < e.  We  can  conclude  that  the  event  {r  s e}  occurs 
n n 

only  if  one  or  more  of  the  spheres  Sgy4(yj[)  contains  none  of  the  random 
vectors  Xj X . Then,  if  {X  } denotes  the  event  that  Xj 


... 


is  not  an  element  of  S , (y  ) , 

e/4  1 


Tn  H * P 


fu  n jx 

Li=l  j=l  J 


^ Se/4<yi) 


E l-f 

i=l  / •'S 


f(x)dx 


Since  y 


***•/ 


./4lyi> 


f(x)dx  > 0 , so  that 


S«/4(l'i) 


(2.9) 


P(rn  * e}  -*  0 
which  proves  the  lemma . 

In  the  theorems  to  follow,  it  will  be  necessary  to  have  the  sets 
V , 1 s j s n,  form  a partition  of  IRd.  As  defined  in  (2,4) , they  are 

J t ** 

d d 

disjoint,  but  their  union  does  not  cover  IR.  Let  x € K such  that 
n 

x£  U V . Then  there  must  exist  i s n and  j s n,  i ^ j , such  that 
j=l  J'n 

p(X^  ,x)  = P(X.  ,x)  . 

Clearly  x is  a boundary  point  of  the  sets  V,  and  V,  . In  order  to 

i,n  j ,n 

modify  the  sets  V.  n,  1 s j s n,  so  that  they  form  a partition  of  Rd, 

n 

we  arbitrarily  assign  each  x £ U V.  to  one  of  the  sets  for  which  it 

j=l  J,n 

is  a boundary  point.  The  proof  of  Lemma  1 is  unchanged  when  the  V, 

- ' n 

are  modified  in  this  way,  and  throughout  the  rest  of  the  chapter  we 
.will  assume  that  the  sets  n are  a partition  of  Rd. 
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Lemma  2:  Let  X.  . . . . ,X  be  a sequence  of  independent  identically 
1 n 

distributed  random  vectors  taking  values  in  lRd,  and  let  F be  the 

distribution  of  X^ . Let  f be  a density  corresponding  to  F,  and  let 

V.  , 1 sj  sn,  beas  defined  previously.  Then 
) ,n 

max  P { V.  /X.  .....X  }-*0  w.p.l  . 

1*J*„  1 

Proof:  Let  E be  the  support  of  f,  and  let  e > 0 be  given.  Then,  there 
exists  a set  KcE,  compact  with  the  metric  P,  such  that 


P{KC}  < e . 


Then 

max  PfV.  /X,, X } <:  max  P{KV,  /X, ,X  } + P{K°1 

, . ^ 1 j , n 1 n J . . j , n 1 n } l 

lsjsn  lsjsn 

s max  P { KV.  /X  , , . . ,X  } + e . 
lsjsn  J'n  n 

Hence  it  suffices  to  show  that 


max  P { KV.  /X  , . . , ,X  ) -»  0 w.p.l  . 

1 < i s n J'n  1 n 

Let  S denote  the  set  KV.  for  which  P{KV.  /X,  , . . . ,X  } is  maximized 
n j ,n  j / n l n 

over  1 a j s n.  Now 


p>S  s max  uKV. 


so  that,  by  Lemma  1, 


pS^  -»  0 w.p.l. 
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Hence, 

max  P { KV  /X  , X ] -»  0 w.p.l 

lsjsn  J'n  1 n 

since  F is  absolutely  continuous  with  respect  to  Lebesgue  measure  on 
IRd . This  concludes  the  proof  of  Lemma  2 . 


Lemma  3;  Let  X , . , . ,X  and  F be  as  in  Lemma  2.  Let  X1  (x)  be  the 
i n n 

nearest  neighbor  to  x € JRd  from  X,  , . . . ,X  . Then 

1 n 

P{x  € Kd:  X^(x)  -»x  } = 1 . 

Proof:  Let  E be  the  support  of  F,  and  let  6 > 0 be  given.  Then,  there 
exists  K c E such  that  K is  compact  with  the  metric  p and 

P{K}  > 1 - 6 . 

By  Lemma  1 , 

sup  { p(X’ (x),x)3  -»  0 w.p.l  . 
x€K 

Hence  P{x  € Rd:  X^(x)  -*  x}  > 1 - 6 , 

which  proves  Lemma  3 . 

Before  proving  Theorem  1 , it  will  be  necessary  to  briefly  discuss 

some  well-known  facts  concerning  optimal  estimation  rules  for  squared- 

A d d 

error  loss  functions.  Let  e:  1R  -4  1RP  be  an  estimation  rule,  and  let 
0*(X)  be  any  version  of  E(0/X) . Then  0*  also  defines  an  estimation 
rule.  With  the  squared-error  loss  criterion,  the  risk  associated  with 
0 is  given  by 


Efll ©- e(x) ]|2 } = E{||e-e*(x)+  e * (x)  - e(x) |]2 } 

= E{||e-e*(x)||2  + 2(0- e*(x), e*(x) - e(x)) 

+ ||l(X)-0*(X)||2}  , 

where  {* , •)  denotes  the  usual  inner  product  on  1RP.  Now, 

2E  j (0  - 0*(X) , 0*(X)  - 0 IX))  J 

= 2EjEj(e-e*(x)  ,e*(x)-0(x))/xjj  . 

Examining  the  conditional  expectation  inside  the  brackets,  we  have 

e|(0  - 9 * (X) , 9*(X)-0  (X))/xj  = Ej(0  -0*(X) , e*(x))/x} 

- Ej(0-0*(X),i(X))/x)  . 


Letting  0. , 0*(X),  and  ©^X)  denote  the  ith  components  of  0 ,0*(X),  and 


0 (X)  respectively,  we  see  that 


E j (ei~ ®i  (X>)  ( ©* (X))/xj  = 0*(X)tE(0i/X)  - 0*(X)] 


= 0 w.p.  1 , 


since  0*(X)  is  a version  of  Efe^/X),  We  can  conclude  that 


e|(0-9*(X),0*(X))/xJ=  0 


and,  similarly. 


:|(0-0*(X),  0(X))/xj  = 0 . 


This  implies  that 
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Then, 


Ej(e-e*(x), e*(x)  - e(x))J  = o . 


:j||e-S(x) ||2j  = e||| e -g *(x) ||2j  + eJ||©(x)  - e*(x)| 


implying 


]j||0-e*(X)||2j  s E J || 0 - G(X)  I 


with  equality  if  and  only  if 

E 1 1|  ©(X)  - 0*  (X)  J] 2 1 = 0 . 

The  conclusion  is  that  0*(X)  is  an  optimal  estimation  rule,  also  known 
as  a Bayes  rule,  if,  and  only  if  it  is  a version  of  E(S/X) . 

The  risk  associated  with  an  optimal  rule,  known  as  the  Bayes 
risk  and  denoted  R* , is  given  by 

R*  = e|l(© , 9*(X))j  (2, 

= E j ||  0 -E(©/X)||2| 

= e[e[||©  -E(  ©/X)  ll2/x}] 

= e[e(|| ©]]2/X)  - 2E  j(0,E(0/X))/xj  + ||E( e/x) ||Z]  . 


Using  the  fact  that 


we  have 


E{0iE(0./X)/X}  = E^(ei/x) 


E{(0,E(e/x))/x]  = ||E(©/x) ||2 . 


J 
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Then, 

R*  = e[e(H6||2/x)  - 1|E(a/X)|l2]  . (2.12) 

Theorem  1:  Let  (X,  0) , (X,  , 0.) , (X  , 0 ) be  a sequence  of  independent 
11  n n 

identically  distributed  random  vectors,  each  with  distribution  F(x,  0) , 
with  Xj  taking  values  in  TRd  and  0^  taking  values  in  a compact  subset  of 
KP,  1 £ i £ n.  Let  L(0^,  02)  be  the  squared-error  loss  function.  If  the 
distribution  of  X has  a density  and  versions  of  E(0/X)  and  E(||0||Vx) 
exist  which  are  continuous  on  IRd  with  probability  one,  then 

Ln  -»  2R*  in  probability 

for  the  single-nearest  neighbor  rule. 

Proof:  In  order  to  simplify  the  notational  difficulties , the  proof  will  be 

carried  out  for  p = 1 . The  modifications  necessary  to  carry  out  the 
proof  for  p > 1 will  be  discussed  in  the  remarks  at  the  end  of  this 
chapter.  Let  e > 0 be  given.  Then, 

p(|Ln  - 2R*|  s eJ  sp{|Ln  - E(Ln/XrX2 Xn)  | * e/l] 

+ p||E(Ln/X1,X2,...,Xn)  - 2R*|  s c/2]  . (2.13) 

We  will  show  that  each  term  on  the  right-hand  side  of  (2.13)  tends  to 
zero  as  n becomes  large.  The  first  term  can  be  bounded  by  Chebychev's 
inequality  so  that  it  will  be  sufficient  to  show  that 

E[Lm  - E(L  /X, X ))2  -»  0 

n n 1 n 

1 mm ^ 
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or,  equivalently. 


EjE[(Ln  - E(L/X1 Xn))2/Xl Xj|  H 0 


(2.14) 


For  the  squared-error  loss  function,  the  hypothesis  that  e takes  values 


in  a compact  set  guarantees  that  there  is  a constant  M < ® such  that 


- E(L^/X^  , . . . ,X^)j  /Xj  , . . . ,xj  sM  (2,15) 

with  probability  one  for  all  n.  Then,  by  the  Lebesgue  dominated 
convergence  theorem,  (2.14)  will  be  established  if  it  is  shown  that 

E|jLn  “ £(Ln/Xi » • • • ,Xn^)2//Xi ' * * • ,xn]  "*  0 in  probability.  (2 . 1 
Let  Dn=C(X1,01) (Xn,0n)3.  Then 

E[(Ln  ' E<V*1 Xnf/Xl X„] 

= E[{E(L(e,i)/E>n)  - Efue.ei/Xj xJjVx, xj 


The  right-hand  side  of  (2.17)  can  be  written  as 


(2.17) 


E £ [E(ue.9  )I  00/D  ) 
lj=l  J LVj , n 


E(L(e,  ®j)I[v  ](X)/X1 Xn)J  Aj Xn  (2.18) 

J » ri 


where  V was  defined  previously,  and 
J . n 


hv  J(x)  = 

J *n 


1 , if  X€V. 


0 , otherwise  . 


Now,  the  summation  in  (2.18)  will  be  squared,  giving: 

■fel'KVty  nj(X)/Dn) 

-E  (L(  e , Bj)  I[v  j (X)  /Xj xJ^/Xj Xn 

j n 

mi 

-E(L(9.9)I(V  ,00/Xj Xj] 

J .n 

x[e(l(9,9)I.  (»/D) 

i,n 

- E(L(e,  ®i)  I^v  j (X)Aj xjJjAj xl  (2 

i , n j 

The  second  term  of  (2.20) , which  is  the  expectation  of  the  sum  of  the 
cross-product  terms , can  be  written  as  follows  by  taking  advantage  of 
the  fact  that  0^ , conditioned  on  X^ , is  independent  of  0^  for  all  i ^ j : 

r E[{E(L(9.9()ItVj  nJ(X)/X1 Xn.9j) 

-e(l(9,9)I[v  ,(X)/Xj Xj) 

1 J , n 

X {e(L(0,  ®j)^y  j(X)/Xj  / • • • /X^,  ®j<) 

L i,n 

- E^ue.e^y  ](X)/X1,...,xn)}].  (2 

i,n 
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Let  g,  (X  , . . . ,X  ,6.)  be  any  version  of 
j i n j 


E(L(e'ej)I[V.  ]{X)A1 xn'ej)  ■ EfLte.e^cv  ](X)/X1 Xn). 

j / n j , n 

Then  (2.21)  is  equal  to 

5~E[g  (X  , . . . ,X  , 0.)g.(X X ,0.)/X. X]  . (2.22) 

J 1 n j i i nil  n 

Since  0.  and  0^ , conditioned  on  Xi  and  X^. , are  independent,  (2.22)  is 


equal  to 


£e(Vxi W/X1 xnM<Vxi x„) 


= 0 . 


(2.23) 


The  first  term  of  (2.20)  is  rewritten  as  follows,  again  employing  the 

conditional  independence  of  0, , . . . , 0 . 

I n 

n r( 

£E[jE(L(e,e.)i[v  ](X)A1 xn,0.) -e(l(0,0.)i[v  ](x)/x1, . . . ,xj 

j=l  j,n  j,n 

/Xl xj  (2.24) 


SE[jE[|E(L(0,0.)/X,Xj,e.)-E(L(0,ej)/X,Xj)}l[v  ](x)/x1 xjJ2 

j,n 

/X! X„]  • < 

Taking  advantage  of  the  conditional  independence  of  0 and  0^  given 
X and  X, 


(2.25) 
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E[L(e,e.)A,xj,ej]  = e(02/x)  - 2e.E(e/x)  + e2 


(2.26) 


and  ECKe.e^/X.X,]  = E(92/X)  - 2E(0j/XJ)E(0/X)  + E^Aj)  . (2.27) 


Since  the  parameters  take  values  in  a compact  set,  (2.26)  and  (2.27) 
are  bounded,  so  that  (2.24)  is  bounded  with  probability  one  by 


n 


£mE  (l|y  ](X)A1 xn) 


(2.28) 


j=l  '■'J  /h 
n 


for  some  M < «.  Since  £EIr  -.(X)  = 1,  (2.28)  is  bounded  by 

1=1  1 j,nJ 


V]' 


(2.29) 


By  Lemma  2,  (2.29)  converges  to  zero  with  probability  one,  establishing 
■2.14). 

In  order  to  complete  the  proof,  it  remains  to  -how  that  the  second 
term  of  (2.13)  tends  to  zero,  or 


E(L  A, ....  .X  ) -»  2R*  in  probability  . 
n 1 n 


(2.30) 


In  order  to  prove  (2.30)  we  first  note  that 


eu^Aj  , . . . ,xn)  = eCeo^A.Xj xn)Ax , 

The  right-hand  side  of  (2.31)  is  equal  to 


-V 


(2.31) 


e|2>  [El(  0-  0j) 2 1[v  J (X)  A , Dn]  A , xL xn}Ax xn] 


(2.32) 


A 
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- e[£  {E(e2/X)  - 2E(e/x)E(ej/xj)4-E(ej2/xJ)  }i[v  ](X)A1 xnj. 


(2.33) 


By  Lemma  3 , there  exists  a set  C c IR  such  that  P{C  ] = 1 , and  for  all 

x € C,  X‘  (x)  -*x  with  probability  one.  Let  x € C.  Then 
n 


n 


53  {E(0  /X=x)  - 2E(6/X=x)E(9  /X  ) + E(9.2/X.)  }lr.,  , 

j=l  ) ] ) ) LVjnl 


2[E(92/X=x)  + E2(0/X=x)]  w.p.l 


(x) 


(2.34) 


where  we  have  used  the  continuity  with  probability  one  of  E(0/X)  and 
2 

E(0  /X).  Since  9 takes  values  in  a compact  set,  and  since  the  con- 
vergence in  (2.34)  holds  for  a set  in  lRd  which  has  probability  one, 
the  Lebesgue  dominated  convergence  theorem  implies 


E(L  /X1#...,X  ) -*2R*  w.p.l  . 
n 1 n 

This  concludes  the  proof  of  Theorem  1 . 

A 

The  convergence  of  Ln  for  the  case  where  L(9,  9)  is  a metric  loss 
function  is  difficult  to  prove.  However,  in  the  following  theorem  it  is 
shown  that  L^  is  dominated  by  a random  variable  which  converges  in 
probability  to  2R* . 


Theorem  2:  Let  (X,  0) , (X, , 0,) , . . . , (X  , 0 ) be  a sequence  of  independent 
li  n n 

identically  distributed  random  vectors,  each  with  distribution  F(x,  0) , 
with  X taking  values  in  lRd  and  0 taking  values  in  IRP.  Let  L(0, 0)  be  a 


r 
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bounded  metric  on  IRP.  Then,  for  all  e > 0 

P{L  - 2R*  & e}  -»  0 
1 n 

if  E[L(0*(x)  ,8)/X=x,Xj]  is  continuous  with  probability  one  for  each  x€lRd 
and  if  a marginal  density  for  X exists  . 

Proof:  Since  L(e,  e)  is  a metric  on  JRP, 

L(0,f)  sL(6,e*)  + L(e*,e)  (2.35) 

where  0*  = 0*(X)  is  an  optimal  (Bayes)  estimate  of  6.  Then 

Ln  = E {L(e,0)/(Xlf  9j) (Xn'6nn  (2.36) 

S e {L(  0 , 0* )/(x1 , 91 ),...,  (xn , en) } + E {L ( e* , e)/(x1 .8^ (xn , en) } 

= R*  + E{L(0*  , 0)/(X  , 0 ) (X  ,6)}  . (2.37) 

n n 

The  theorem  will  be  proved  if  we  show  that 

E{L(0*,0)/(X1,ei),.../(X  , 0 )}-»  R*  in  probability  . (2.38) 

i 1 n n J 

Let  L’  = EfL(0*  , 0)/(X,  , 9 ,),..., (X  ,0)]  . The  proof  of  (2 .38)  will 
n il  n n 

be  performed  in  two  steps,  by  showing  that,  for  each  e > 0, 

P { |L’  - E(L'/X. X)  | s el  -»0  (2.39) 

1 n n 1 n 1 

and  PflEtt^/Xj Xn)  - R*  | s e } -»  0 . (2.40) 

The  proof  of  (2.39)  will  be  shown  first.  By  Chebychevs  inequality  it 
suffices  to  show  that 

xn)]2  0 


E[L’  - E(L'  /X 
n n 1 


(2.41) 


or,  equivalently 


E {E&/  - E (L^/X1 Xn) ) ■ Xn] } -*  0 . (2 . 42) 

Since  L is  bounded , the  Lebesgue  dominated  convergence  theorem  can 
be  used  to  show  that  (2 . 42)  is  true  if 


E^n-E^Lr/Xl ' “ * ,Xn^  /Xj  ,...  ,Xr}  -►  0 in  probability.  (2.43) 


Let  Vj  n and  Z[V  be  as  defin®d  previously.  Then, 

j,n 


Eftt;  - e(l;/Xi xn)]2/x1 xn) 

It 


= E 


L,  {E(L(e*,e  )ir  • -,(x)/d  ) 

U=1  3 j , n n 

- E(L(e*,  e.)i[v  ](X)/X1 xn)}l2/xr 

j , n J 


(2.44) 


where  Dn  - {(Xj , 0^) , . . . , (Xr  , ©n)  } . Squaring  the  summation  in  (2 . 44) 
yields 


= E 


n 


|£  j(x)/t>n) 

- EWe*,^)^  ](X)/X1 xn)}2  /x x 

J,n  J n. 

+ Efljj  (E(L(e*-ej)i[vJ_n]M/Dn) 

- e(l(0*  , 9j)I[v  j(X)/X1 Xn)} 

j ,n  ’ 

X {e<L(0*  , 6j)I[v  ](X)/Dn) 


i<n 


- e(l(0*,9i)i[v  ](x)/x1 xj)  /xx xn 


(2.45) 
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The  second  term  of  (2.45),  the  expectation  of  the  sum  of  the  cross- 
product  terms,  is  easily  seen  to  be  identically  zero  by  using  the  con- 
ditional independence  of  0^  and  to  bring  the  expectation  over  and 
6.  inside  the  summation  as  a product  of  expectations . The  techniques 
are  similar  to  those  used  in  the  proof  of  Theorem  1 . 

The  first  term  of  (2.45)  is  equal  to 

jEj^(L(0*,e.)A,X.,0.|  - E(L(e*,0j)/X,Xj)}/X.) 


xl[v  ](X)/X1 xl/Xj 

L j , n J 


(2.46) 


By  the  hypothesis  concerning  the  boundedness  of  L,  there  exists  M < « 
such  that  (2 . 46)  is  bounded  by 


n 2( 

2ME  (I[v,  ](X)Aj,  • . . ,Xnj  w.p. 


(2.47) 


*1 

Since  ^ Elr  -.(X)  = 1,  we  can  bound  (2.47)  by 
j=l  LJ,nJ 


if*  max  {E(Irv  1(X)/X1 
Llsjsn  lVj,nJ  1 


, • • • , X 


which  converges  to  zero  with  probability  one  by  Lemma  2 . This 
establishes  (2.39). 

In  order  to  complete  the  proof,  we  must  show 


E(L'  /X.  , . . . ,X  ) -*  R*  in  probability 
n l n 
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The  proof  of  (2 . 48)  is  done  by  noting 


V = EfEDUe* , e)/Dn]/x1 xn) 

- £{e [e |e(l(B. . Sjt/x.Xj , 0,(1  ^(xia.Xj xJ/Xj xn 

[n 

E E{L(e*  , 9 )/X,X  }lr  AX)/X- X 1.  (2.49) 

j=l  1 J LVj , n 1 nJ 

Now,  by  Lemma  3,  there  exists  a set  C c IR11  such  that  P{C  } = 1 , and 

for  all  x 6 C,  X^(x)  -♦  x with  probability  one.  Then,  for  all  x e C, 

n 

^ E{L(e*(x) , 0 )/X=x,X  }I  (x)  -♦  E {L( 0* (x) , 0)/X=x } w.p.l,  (2.50) 
3=1  } J i,n 

since  E{L(9g,  9)/X}  is  continuous  with  probability  one  by  hypothesis. 
Then,  since  L is  bounded,  (2.50)  and  the  Lebesgue  dominated  conver- 
gence theorem  imply  (2.48).  This  concludes  the  proof  of  Theorem  2. 


II. 3 The  k-Nearest  Neighbor  Rule 

In  this  section,  the  results  of  section  II. 2 will  be  generalized 
to  the  k -nearest  neighbor  rules  where  more  than  one  neighbor  of  X 
enters  into  the  estimate  of  9.  The  theorems  in  this  section  will  assume 
k is  fixed,  while  in  section  II.4,  k will  be  allowed  to  increase  with  n. 

Intuitively,  the  advantage  of  using  more  than  one  nearest  neigh- 
bor in  the  estimation  process  is  simply  the  hope  that  the  larger  sample 
will  smooth  out  some  of  the  statistical  variation  and  produce  a better 
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estimate.  The  pitfall  associated  with  increasing  k stems  from  the  fact 
that  as  more  samples  are  used  in  the  estimate,  the  kth-nearest  neighbor 
to  X gets  farther  away  from  X,  on  the  average.  Hence,  the  strength  of 
the  statistical  relationship  between  X and  its  kth-nearest  neighbor 
declines  as  k increases.  In  general  then,  the  above  arguments  would 
indicate  that  if  n is  fixed,  increasing  k should  result  in  improved  per- 
formance until  k reaches  the  point  where  too  many  samples  are  being 
used  which  have  little  or  no  statistical  relationship  to  (X,e).  The 
difficulty  lies  in  determining  where  this  value  of  k is  reached , a non- 
trivial problem  which  is  quite  dependent  on  the  form  of  F(x,  0) . In  fact. 
Cover  and  Hart  [ 6 ] give  an  example  of  a distribution  for  the  discrimina- 
tion problem  in  which  the  single-nearest  neighbor  rule  performs  better  than 
any  k-nearest  neighbor  rule  where  k > 1 . The  same  ideas  can  be  used  to 

produce  a similar  example  for  the  estimation  problem.  Basically,  a 

2 

distribution  F(x,  0)  is  chosen  on  IR  which  is  uniform  on  i unit  discs 
centered  at  { ( 1 Oj , lOj)  ,j=l  In  this  case,  with  the  squared- 

error  loss  function,  the  expected  loss  for  a k-nearest  neighbor  rule  is 
bounded  by  1 if  at  least  k data  points  lie  on  each  disc.  If  one  of  the 
discs  contains  fewer  than  k data  points,  the  expected  loss  increases 
dramatically.  For  a fixed  n,  as  k increases,  the  likelihood  of  k sur- 
passing the  smallest  number  of  samples  on  a disc  increases,  so  that 
the  risk  increases.  This  discussion,  while  not  precise,  shows  how  a 
riy  /rous  example  can  be  constructed. 
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The  purpose  of  the  preceding  paragraph  was  to  point  out  the 
fact  that  when  n is  fixed,  it  is  always  possible  to  find  distributions 
for  which  the  single  nearest  neighbor  rule  can  be  expected  to  perform 
better  than  any  k > 1 nearest  neighbor  rule.  Generally  speaking, 
larger  values  of  k require  larger  data  sets  to  insure  adequate  performance, 
However,  the  theorems  in  this  section  will  show  that  when  sufficiently 
large  data  sets  are  available,  the  use  of  larger  values  of  k can  cut  the 
expected  loss  significantly.  The  added  computational  cost  entailed  by 
increasing  k is  associated  with  the  cost  of  creating  and  maintaining  a 
memory  stack  of  the  current  k-nearest  neighbors  as  the  distance  from  X 
to  each  observation  in  the  data  is  computed.  The  additional  cost  does 
not  appear  to  be  prohibitive. 

The  proofs  of  Lemmas  4,  5 and  6 and  Theorem  3 will  make  use  of 

the  proofs  in  section  II. 2.  Lemma  4 is  the  k-nearest  neighbor  rule  analog 

of  Lemma  1.  The  following  notation  will  be  necessary.  Let  JL  = /n)  = 

n \k/ 

n( 

k I (n_k) , / so  that  there  are  *n  possible  different  subsets  of  {X1 , . . . ,Xn), 
where  each  subset  contains  k observations.  These  subsets  will  be  de- 
noted S ,...,S  , and  T will  be  defined  as  the  set  of  points 

n 

which  have  S as  their  set  of  k-nearest  neighbors  from  the  data.  (Note 
that  {Tj  n}j_2  may  be  assumed  to  form  a partition  of  1R  as  was  discussed 
previously.) 
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Lemma  4:  Let  , . . . ,Xfi  be  a sequence  of  independent  identically 
distributed  random  vectors  taking  values  in  IRd,  and  let  f be  a proba- 
bility density  function  on  IRd  corresponding  to  the  distribution  of  X . 
Let  p(  - , •)  be  a metric  on  IR  , let  E c 1R  be  the  support  of  f,  and  let 
KcEbe  compact  with  the  metric  p.  Then 


and 


r = max  ( sup  p(xfy)| -♦  0 w.p.l 

lsjsi  x,y€KT 

n t j , n > 

\ = max  I y, KT  -»0  w.p.l  . 

n ) J»n{ 


(2.51) 

(2.52) 


Proof:  As  in  the  proof  of  Lemma  1 , it  is  necessary  only  to  show  that  for 
each  e > 0, 

P{rn  s e}  .*  0 . (2.53) 

By  the  compactness  of  K there  exists  a finite  set  of  spheres  s€/4jc(yi)' 

1 s i s m,  such  that 

m 

K C1i1£V4k,yi)  • (2'54) 

As  in  the  proof  of  Lemma  1 , it  is  easy  to  see  that  if  each  of  the  spheres 

contains  at  least  one  of  the  observations  X,  , . . . ,X  , then  each  x€K 

l n 

has  its  k*h  nearest  neighbor  no  further  away  than  e/2.  Conclude  that 

if  each  sphere  contains  at  least;  one  of  the  X,  , . . . ,X  , then  T.  , 

. I n j ,n 

1 s j sx  ,can  be  contained  in  a sphere  of  radius  less  than  e/2.  The 
n 

proof  is  concluded  in  the  same  manner  as  the  proof  of  Lemma  1 , by 
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showing  that  the  probability  that  any  of  the  spheres  contain  none  of 
the  X. , 1 si  sn,  tends  to  zero  with  n tending  to  infinity. 

Lemma  5:  Let  X , . . . ,X  and  F be  as  in  Lemma  2 , and  let  T 

In  j ,n 

1 ^ j s be  as  defined  previously.  Then 

max  P {T  /X.  ,..,,X  ]-*0w.p.l.  (2.55) 

lsjSjK  J,n  1 n 

n 

Proof:  The  proof  of  this  lemma  is  identical  to  the  proof  of  Lemma  2 , 
except  that  Lemma  4 must  be  used  instead  of  Lemma  1. 

Lemma  6:  Let  X,  , . . . ,X  and  F be  as  in  Lemma  2 , and  let  X„ . (x)  be 

In  (k),n 

the  k1*1  nearest  neighbor  to  x € IRd  from  X,  , . . . ,X  . Then 

1 n 

F{x£lRd:  X^  n(x)-*x}=l  . (2.56) 

Proof:  This  proof  is  essentially  the  same  as  the  proof  of  Lemma  3, 
except  that  Lemma  4 is  used  in  place  of  Lemma  1 . 

The  following  theorem  relates  the  asymptotic  performance  of  the 
k-nearest  neighbor  rule  to  the  performance  of  the  optimal  Bayesian  esti- 
mator for  squared-error  loss  functions. 

Theorem  3:  Let  (X,  e) , (X^  , 9j)  , . . . , (Xn<  en>  be  a sequence  of  independent 

identically  distributed  random  vectors,  each  with  distribution  F(x,  e) , 

with  each  Xl  taking  values  in  IRd,  and  each  ©i  taking  values  in  a compact 

subset  of  IRP.  Let  L(8^,  e2)  be  the  squared-error  loss  function.  If  the 

9 

distribution  of  X has  a density  and  versions  of  E(9/X)  and  E(||©||  /X) 
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exist  which  are  continuous  on  IRd  with  probability  one,  then 
L -»  (1  + j~)R*  in  probability 
for  the  k -nearest  neighbor  rule. 


Proof;  We  again  assume  p = 1 the  case  of  p > 1 being  deferred  to 

the  remarks.  First,  note  that 
lc+1 

PfIV  k R*l  sPflLn-E(Ln/x1 x„)l  2 e/2) 

+ P^EIL^ Xn)  -^R*!  s e/2)  . (2.57) 

As  in  the  proof  of  Theorem  1,  each  term  of  (2.57)  will  be  shown  to  go  to 
zero.  The  proof  that 


P{lLn"E(Ln/Xl xn}  I ^ c/2} 0 

£n 

proceeds  exactly  as  before  except  that  {T.  } are  used  to  partition  the 

J / A J 

space  into  estimation  regions  instead  of  {V,  }n  , as  in  the  proof  of 

J ,nJj=l  H 

Theorem  1.  Also,  Lemma  5 must  be  used  instead  of  Lemma  2,  but  no 
other  serious  difficulties  are  encountered. 

The  proof  that 

lc+1 

E(Ln/X1 , . . . ,Xn)  -»  R*  in  probability 


proceeds  as  follows.  As  defined  previously,  the  set  S 


j,n 


fX,1  , . . . ,X^  } consists  of  the  k observations  from  the  data  which 

1 J i n j ,n 

are  the  k-nearest  neighbors  to  each  x (T,  , 1 s j s l . Let 

J , n n 
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1 k 

®j  ,n'  * ' * ' ®j  ,n  Parameters  associated  with  the  observations  in 


S and  let 
J ,n 


1 k i 

93/n  k ^ 9j,n  * 


(2.58) 


Note  that  (2.58)  agrees  with  the  definition  given  by  (2.2)  for  the  k- 
nearest  neighbor  rule  estimate  of  0 for  each  x f T 

j,n* 


Then, 


= E 


= E 


E(Ln/Xi , . . . ,Xfi)  - E[E(Ln/X,X1 , . . . /X^/Xj  , . . . ,Xn] 

r^n 

2>{E[(0-0  ) I i (X)/X , D ]/X , X X }/X X 

j=i  J'n  LTj,nJ  n 1 n 1 n 


2 - - 2 

L fE(e  /x)-2E(e/x)E(e  /s  )+E(e z /s.  )}ir_  ,(x)/x 

trl  j,n  j,n  j,n  LT.  J 

),n 


1»  • • • , X 
n 


(2.59) 

Notice  that,  because  of  the  conditional  independence  of  the  parameters 
given  the  observations, 

k 


E(®1  n /S,  M ) 

j , n j , n k j , n j , n 


(2.60) 


and 


(2.61) 


Now,  if  X*  n -»x  for  1 si  sk,  then,  since  E (e/X)  and  E(02/X)  are  con- 
tinuous with  probability  one,  from  (2.60)  and  (2.61)  it  can  be  seen  that 


n 

iC  E^®j  ,r/Sj  ,n^[T  ]M  -*  E(e/X=x)  w.p. 
J=1  ' ' j,n 


(2.62) 


2 E*ej,2r/Sj  ,n^[T  ](x)  -»£-  E(e2/X=x)  + E2(e/X=x)  w.p.l.  (2.63) 

j=l  ' ' j,n 

-j 

By  Lemma  6 , there  exists  a set  C c 1R  such  that  P (C  } = 1 and , for  all 
x f C,  the  k -nearest  neighbor  to  x converges  to  x with  probability  one. 
Then,  for  each  x € C 


£ {E(8Vx)-2E(eA=x,E(iJ_n/sj_n)+Erej2_n/Sj  n,„  « 

1 L i ,n 


(i  + ^)[E(e2/x=x)-E2(e/x=x)]  w.p.:. 


(2.64) 


Since  0 takes  values  in  a compact  set,  and  (2.64)  holds  on  a set  which 
has  probability  one,  the  Lebesgue  dominated  convergence  theorem 
implies  that  (2.59)  converges  to  (1  + l/k)R*  with  probability  one. 

This  completes  the  proof. 

The  advantage  of  using  more  than  one  nearest  neighbor  in  the 
estimation  process  when  large  data  sets  are  available  is  made  apparent 
by  Theorem  3.  In  fact,  for  large  values  of  k the  risk  approaches  being 
half  that  of  the  single-nearest  neighbor  rule. 


II. 4 The  kn-Nearest  Neighbor  Rule 


In  many  applications,  the  data  set  available  to  the  statistician 
is  continually  increasing  in  size.  In  cases  like  this,  a fixed  value  of 
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k will  eventually  become  too  small  to  take  fullest  advantage  of  the  size 
of  the  data  set.  An  obvious  solution  to  this  problem  is  to  allow  k to  be 
an  increasing  function  of  n.  In  this  section  conditions  will  be  given  for 
which  the  pei romance  of  the  kn- nearest  neighbor  rule  is  asymptotically 
optimal  for  squared-error  loss  functions . 

The  following  lemma  is  another  generalization  of  Lemma  1 . 

We  now  let  £n  = j so  that  t, n is  the  number  of  different  subsets 

that  can  be  obtained  from  {X. , . . . ,X  1 where  each  subset  contains  k 

In-  n 

elements.  These  subsets  will  be  denoted  S,  , . . . ,S  . We 

l.n  i ,n 

n 

will  define  T for  1 sjsi  as  the  set  of  x€F  such  that  S,  is  the 
)»n  n j,n 

set  which  contains  the  k - nearest  neighbors  to  x from  X,  , . . . ,X  . 

n -1  n 

Once  again  we  note  that  the  sets  T.  , Ujsi  , can  be  modified  to 

j ,n  n 

form  a partition  of  Fd,  as  discussed  previously. 

Lemma  7:  Let  Xj , . . . ,Xn  be  independent  identically  distributed  random 

vectors  taking  values  in  F , and  let  f be  a probability  density  function 
d 

on  F corresponding  to  the  distribution  of  X1 . Let  p(* , •)  be  a metric 
on  Fd,  let  E be  the  support  of  f,  and  let  KcE  be  compact  with  the  metric 


p . Then , if  Kn/n  -»  0 , 
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and  X = max  {g,KT.  } -»  0 w.p.l  . 

lsjsi  ,,n 

n 


(2.65) 


Proof:  Let  e > 0 be  given.  As  in  the  proof  of  Lemma  1 , since  r is 

n 

monotonically  decreasing,  it  suffices  to  show 
P{rn  *«}■*(>. 

Since  K is  compact  there  exists  a finite  set  of  points  fy  , . v 1 

1 1 mJ 

such  that 

m 

K c £ VW  • 

Assume  that  each  of  the  spheres  se/4(yi) , 1 sisn,  contains  at  least 
^n  observations  , . . . Then  the  distance  from  any  x£K  to 

its  knth  nearest  neighbor  is  less  than  e/4.  This  implies  that  each  set 
KTj  n can  be  contained  in  a sphere  with  radius  less  than  or  equal  to  e/2 . 
It  remains  to  show  that  the  probability  that  each  sphere  contains  at  least 
^n'°^Serva^ons  conver9es  to  one,  or  that  the  converse  converges  to  zero. 

P(rn&  e]  s p[„  £Se/4(y.)  contains  fewer  than  kR  observations^ 


(2.66) 


where  p = J , . f(x)dx,  lsism.  Feller  [13,  p.5l]  shows  that  if 
°e/4vyi' 


k < np. , then 
n i 


V1 


S GW'-’/-’ 


(n-k  +l)p 
n 

(npj^-k^+l)2 


(2.67) 


n-k  +1 
n 


1 pi 


2(k  +1)  (k+1)' 

^ n n 


— Pi  + 


(2.68) 


Since  kn/n  -»  0,  kfi  < npA  for  n sufficiently  large,  and  (2.68)  converges 
to  zero  as  n -*  » almost  surely.  The  convergence  of  X follows  immediately, 
which  proves  the  lemma. 

The  proofs  of  the  following  two  lemmas  follow  from  Lemma  7 
in  the  same  way  that  Lemmas  2 and  3 follow  from  Lemma  1 . 

Lemma  8:  Let  X , . . . ,X  and  F be  as  in  Lemma  2 , and  let  T , 
in  j ,n 

1 £ j £ i , be  as  defined  for  k -nearest  neighbor  rules.  Then 
n n 


max  P [T  /X. , . . . ,X  } -*  0 w.p.l  . 

1 j'n  1 n 
n 


(2.69) 


Lemma  9:  Let  X , . . . ,X  and  F be  as  in  Lemma  2 , and  let  X.,  (x) 

1 n (k  ),n 

th  d ^ 

be  the  k nearest  neighbor  co  x€R  from  X ,X  . Then 

n I n 


P{x?lR  : X^k  ^ n(x)  -»x)=  1 . 


(2.70) 


Theorem  4 proves  the  asymptotic  optimality  of  k -nearest  neighbor 

n 

rules  in  estimation  for  squared-error  loss  functions. 


mmmm 
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Theorem  4:  Let  (X,  0) , (X.  , 0, , . . . , (X  , 0 ) be  a sequence  of  independent 
— ■ 1 i n n 

identically  distributed  random  vectors,  each  with  distribution  F(x,  0) , 

with  each  observation  Xi  taking  values  in  lRd  and  each  0^  taking  values 

in  a compact  subset  of  JR*1.  Let  L(©^,  ©2)  be  the  squared-error  loss 

function.  If  the  distribution  of  X has  a density  and  there  exist  versions 
9 

of  E(||e||  /X)  and  E(e/X)  which  are  continuous  with  probability  one, 

k -»  00  and  k /n  -»  0 , then 
n n 

L^  -4  R*  in  probability 

for  the  k -nearest  neighbor  rule, 
n 

Proof:  The  proof  is  quite  similar  to  the  proof  of  Theorem  3 , except  that 
Lemmas  8 and  9 are  used  instead  of  Lemmas  5 and  6,  and  we  must  show 
that 


E(L  /X,  , . . . ,X  ) -»  R*  in  probability.  (2.71) 

n 1 n 

1 kn 

As  in  the  proof  of  Theorem  3 , let  e , . . . , 0.  be  the  parameters 

j ,n  ) ,n 

associated  with  the  k observations  in  S.  . Then,  as  before,  if  we 

n j ,n 

define 


= — ^ 01 
,n  k i>n 

n i=i 


(2.72) 


then  0,  is  the  k -nearest  neighbor  rule  estimate  of  e for  each 
J .n  n 

x € T.  .As  before, 
j ,n 


E(L  /X. X) 

n 1 n 
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J6n 

■ eE  ys,  ,^ln/Sj  n)  ,I[T 

J J »n 


(X) 


Ax xj. 

The  same  techniques  that  were  used  in  the 
used  to  show  that 


(2.73) 

proof  of  Theorem  3 can  be 


i(x) 


S tE<8  ^-»(«^E(iJiI/BJ<n)+E(iJ2jI/S  )JI  , 

J J *n 

-»  ^(e2/X=x)  - E2  ( e/X=x)J  w . p . 1 (2.74) 


for  each  x in  a set  which  has  probability  one.  Then,  since  6 takes 
values  in  a compact  set,  the  Lebesgue  dominated  convergence  theorem 
implies  that 


E(Ln/Xi Xn)-+R*  w.p.l  , (2.75) 

which  completes  the  proof  of  Theorem  4 . 


II. 5 Nearest  Neighbor  Rules  With  Unequal  Weighting 

The  theorems  in  the  preceding  sections  of  this  chapter  are  con- 
cerned with  the  asymptotic  performance  of  k-nearest  neighbor  rules  which 
assign  equal  weight  to  each  of  the  k-nearest  neighbors  used  in  forming  the 
estimate.  In  some  cases  the  statistician  may  want  to  make  use  of  an 
unequal  weighting  scheme  so  that  observations  which  are  closer  to  X 
contribute  more  heavily  to  the  estimate  of  e than  observations  which  are 
farther  away.  In  such  a scheme,  the  estimate  e would  be  given  by 
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9 


(2.76) 


where  0^  is  the  parameter  associated  with  the  observation  which 


is  the  ith  closest  observation  to  X in  the  data  set.  Substituting  the 


version  given  by  (2.76)  for  0.  in  equations  (2.62),  (2.63)  and  (2.73)  and 

} i ft 


completing  the  remaining  steps  in  those  proofs  yields 

.k  _ 


-+  (1  + R*  in  probability 


(2.77) 


where  we  have  assumed  Vai  = 1 which,  of  course,  we  can  do  without  loss  of 

* ±.  2 

generality.  Lagrangian  techniques  can  be  employed  to  show  that  V a. 

k 

is  minimized,  subject  to  the  constraint  that  £ a^  = 1,  when  the  a^are  all 


equal  to  lA.  The  conclusion  is  that  large  sample  performance  suffers 
when  unequal  weighting  schemes  are  used.  This  suggests  that  rather 
than  decrease  the  weighting  of  the  nearest  neighbors  which  are  farther 
from  X,  the  statistician  may  get  better  results  by  using  equal  weighting 
with  a smaller  value  for  k . 


II. 6 Remarks 


The  theorems  in  this  chapter  have  all  required  continuity  with 
2 


probability  one  of  E(||0||  /X)  and  E(e/X) . A function  g of  a random 
variable  X is  continuous  with  probability  one  if 


P[X€D}  = 0 
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where  D is  the  set  of  discontinuity  points  of  g.  We  note  that  if  a joint 
density  f(x,  6)  exists  for  F(x,  0),  then  one  version  of  E(0A=x)  is  given  by 


e(«/x = x)  =y6^ae 

if  f(x)  = fax  , 8)d0  > 0 . 

(See,  for  example,  Breiman[7],  Ch.  4.) 
Similarly,  when  (2.79)  holds, 

E(||e||2/x=x)=  yileli2  . 


(2.78) 


(2.79) 


(2.80) 


From  (2.78)  and  (2.80)  it  can  be  seen  that  if  f(x,  0)  is  u~  almost  every- 

where  continuous  in  x,  n denoting  Lebesgue  measure  on  1R  , then  the 

2 

versions  of  E(0/X)  and  E( || e j|  /X)  given  by  (2.78)  and  (2.80)  are 
continuous  with  probability  one. 

The  proof  of  Theorem  1 for  p dimensional  parameter  spaces  is 
the  same  in  all  major  respects,  the  only  minor  difficulty  being  the 
question  of  the  convergence  of 


£e{|]0-0.H2A=x,x  }ir  ,(x)  . 
j=l  J J L j,nJ 


To  establish  Theorem  1 it  will  be  sufficient  to  show  that  if  X'  (x)  -*  x 

n 

with  probability  one,  then 


2D  E{|1© -0  ||Vx=x,x  }ir  ,( 

j=l  J 1 L i.nJ 
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2{(||e||2/X=x)  - ||E(e/X=x)||2}  w.p.l.  (2.81) 

Expanding  the  expectation  on  the  left  side  of  (2.81), 

E{||e-ejH2/x=x,xj}  = E{||e||2/x=x}  - 2E{(e,  e.)/x=x,xj } 

+ ECl|6jH2/X.}  . (2.82) 

In  order  to  prove  (2.81),  we  first  show 
n 

2^E{(e,e  )/X=x,X  )Ir  ,(x)  -»  ||E(e/X=x)||  . (2.83) 

j=l  J J j , n 

The  left  side  of  (2.83)  is  equal  to 

Xj  ^E(eVx=x)  EOj/X.)  I[v  j(x) 


E(eVx=x)  E(ej/x.)  i[v  j(x)  , 

j «n 


(2.84) 


where  0*  and  ©!  are  the  ith  components  of  0 and  0.  respectively.  Since 
E(9/X)  is  assumed  continuous  with  probability  one,  and  X^(x)  •+  x with 
probability  one,  (2.84)  can  be  seen  to  converge  to 


E2(ei/x=x) 


(2.85) 


with  probability  one.  But  (2.85)  is  equal  to  the  right-hand  side  of 

(2.83),  as  we  intended  to  prove. 

2 

Noting  that  E { ||  0 1|  /X]  was  assumed  continuous  with  probability 
one , the  convergence  of  X'  (x)  -»  x with  probability  one  yields 


I 


w 
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X]Etllejll  /% } !rv  iW  -»E{||e||2/X=x}  w.p.l  . (2.86) 

The  desired  convergence  in  (2.81)  follows  immediately  from  (2.82), 
(2.83),  and  (2.86).  The  remainder  of  the  proof  of  Theorem  1 for  p > 1 
uses  predominantly  the  same  techniques  as  the  proof  for  p = 1 . For  k- 
and  kn-nearest  neighbor  rules,  the  same  remarks  will  still  hold,  but  the 
notation  and  algebra  become  more  Involved  since  we  must  now  deal  with 
the  average  of  the  parameters  of  the  nearest  neighbors  instead  of  just 

®A- 

The  proofs  in  this  chapter  have  employed  the  condition  that  the 
distribution  of  X have  a density.  This  condition  can  easily  be  weakened 
to  the  requirement  that  the  distribution  for  X be  continuous . This  permits 
the  distribution  of  X to  have  a singular  continuous  part. 

The  proof  of  Theorem  2 contains  the  tacit  assumption  that  a Bayes 
rule  exists,  which  may  not  always  be  the  case.  However,  for  any  e > 0 
we  can  find  a rule  which  has  risk  less  than  R*  + c.  Such  rules  are  known 
as  e-Bayes  rules  (see,  for  example,  Ferguson  [15]).  Let  e > 0 be  given, 
and  let  0'(x)  be  a rule  with  risk  R1  sR*+  e.  Then,  the  techniques  used 
in  the  proof  of  Theorem  2 can  be  used  to  show  that 

P{Ln  - 2R'  s e]  -»  0 (2.87) 

But  (Ln  - 2R*  ie)2{Ln-2R*s3e], 


J 
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so  that  (2.87)  implies 

PfL  - 2R*  £ 3e}  -»  0 . 

1 n 

Hence  Theorem  2 is  true  whether  or  not  a Bayes  rule  exists . 

As  a final  remark  concerning  the  convergence  of  Ln<  we  will 
give  an  example  concerning  the  rate  at  which  P{  |Ln  - R|  s e}  goes  to 
zero,  where  R is  (1  + l/k)R*.  Consider  the  case  where  0€{1,2, 
and  P{e=i}  = 1/jfc,  Isis  l.  Let  f(x/e)  be  defined  as 

1,  i-J  <x  si+J 

f(x/e=i)  = 

0,  otherwise. 

Then  R*  = 0 since  given  X,  the  value  of  0 is  determined.  Hence 

R = (1  + l/k)R*  = 0.  But  for  any  fixed  n,  we  can  increase  a to  spread 

out  the  distribution  so  that  when  t,  > n,  > i~n/ 1 > e for  a sufficiently 

large.  We  can  conclude  that  the  rate  of  convergence  of  L can  be 

n 

arbitrarily  slow  in  the  sense  that  for  any  n and  e > 0,  there  exist 
distributions  F(x,0)  for  which  PflL  - R|  > e")  = 1.  This  example  can 
easily  be  modified  so  that  9 is  a continuous  parameter  by  allowing  0 
to  take  values  in  small  intervals  surrounding  1,2 R*  is  then  no 
longer  zero,  but  can  be  made  arbitrarily  small  by  appropriately  choosing 
the  size  of  the  intervals  for  e.  The  rest  of  the  example  follows  easily. 


5F 
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III.  THE  EVALUATION  OF  FINITE  SAMPLE  PERFORMANCE 
FOR  ATJ  ESTIMATION  RULE 

III.  1 Motivation  for  Chapter  III 

The  evaluation  of  the  finite  sample  performance  of  an  estimation 
rule  is  the  process  of  estimating  the  value  of  Ln>  the  expected  risk  con- 
ditioned on  a data  set  containing  n observations.  The  problem  is  ana- 
logous to  error  estimation  for  discrimination  rules.  Toussaint  [5  ] has 
compiled  a bibliography  of  papers  on  error  estimation  in  which  he  includes 
a brief  discussion  of  the  most  commonly  employed  techniques.  Most  of 

these  techniques  also  find  application  in  estimating  L for  estimation 

n 

rules. 

In  this  chapter  we  will  be  concerned  with  finding  an  upper  bound 
for 

PflLn'Lnl  (3-D 

A 

where  Ln  denotes  the  estimate  of  L^.  In  the  preceding  chapter,  we  were 

concerned  merely  with  the  question  of  the  convergence  of  L without 

n 

bothering  to  investigate  the  rate  of  such  convergence.  Indeed,  as  was 
pointed  out  at  the  close  of  Chapter  II,  the  rate  of  convergence  of  L 

n 

can  be  arbitrarily  slow,  depending  on  the  structure  of  F(x,  e) . In  this 
chapter,  the  rate  at  which  (3 . 1)  goes  to  zero  will  be  of  great  importance  , 
the  reason  being  that  if  a minimal  rate  can  be  established  which  is 
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independent  of  F(x,  e) , it  will  enable  the  statistician  to  determine  for 
a given  n and  e the  value  of  6(n,  e)  for  which 


p{  lLn  ~ ^ s 6(n,e)  • 


(3.2) 


Such  a bound  is  termed  distribution-free  and  it  is  quite  useful  to  the 

statistician  since  it  enables  him  to  determine  how  much  confidence 

he  can  place  in  his  estimate  of  L . 

n 

As  in  the  preceding  chapter,  we  let  D = ((X.  , 0.) , . . . , (X  ,0  )) 

n ' i i p.  n ' 

be  a sequence  of  independent  identically  distributed  random  vectors, 
each  with  distribution  F(x,  0) , where  each  observation  Xj  takes  values 
in  lRd  and  each  parameter  0.  takes  values  in  a parameter  space  1RP.  Let 
(X,  0)  be  independent  of  the  data  and  distributed  with  F(x,  0) . In  this 
chapter,  we  will  be  concerned  with  a class  of  estimation  rules  somewhat 
broader  than  the  k-nearest  neighbor  rules  considered  in  Chapter  II. 

We  will  say  that  a randomized  estimation  rule  is  one  which 
produces  an  estimate  0 which  is  a random  vector  chosen  from  a distribu- 
tion which  depends  on  X and  D . We  will  assume  that  the  rule  is  de- 

n 

scribed  by  a jointly  measurable  function 


6:  IRd  x 0Rd  x IRP)n  x IRP  -»  [0,1] 


(3.3) 


where,  for  each  X and  D , 6V  ^ is  a distribution  function  on  1RP.  We 

n X,U 

n 

can  define  the  joint  distribution  of  0,  (X,0),  and  D by 

n 


P[Dn  € A,  (X,0)  6 B,  0 6 C] 


f 


■IJ1  d6x  D (0)dF(x,0)dFn(x,e) 

ABC  1 ** 


where  A,  B,  and  C are  Borel  sets  in  (lRd  x lRP)n,  (lRd  x KP) , and  ]RP 
respectively. 

Actually,  a statistician  may  use  a sequence  {6  } of  such  ran- 

n 

domized  estimation  rules  whose  properties  will  vary  with  n,  the  amount 

of  data  available.  For  the  purposes  of  this  chapter,  however,  we  will 

only  be  concerned  with  the  properties  of  the  particular  rule  used  with 

Dn'  rather  0161,1  with  the  whole  sequence.  It  will  be  shown  that  if  the 

rule  is  one  of  a certain  class  of  rules  to  be  described,  then  bounds  of 

the  type  (3.2)  can  be  given  for  two  particular  estimates  of  L . 

n 

The  following  terminology  will  be  used  to  describe  the  rules  in 

which  we  will  be  interested.  A rule  will  be  called  local  with  parameter  r 

if  the  distribution  §x  D is,  with  probability  one,  a function  only  of  X 

' n 

and  (X^  ' ®(i)> ' * ' * ' ^(r) ' ®(r)^ ' w^ere  *(i)  18  *he  i1*1  closest  observation 
to  X from  Dfi.  For  example,  let 

Dn,i=  ((X1 ' 61} ^Xi- 1 ' ®i- ' ^Xi+ 1 ' ®i+ 1) ' - * - » (xn ' ) • (3,4) 

* 

Then,  if  6 is  local  with  parameter  r < n-1. 


6X,D  6X,D  (e) 

n n ,i 


for  all  0 € IRP  whenever  X^.  is  not  one  of  the  r-nearest  neighbors  to  X 
from  Dn*  The  intent  here  is  simply  to  point  out  that,  since  6 depends 


... 


only  on  X and  its  r-nearest  neighbors , any  changes  made  in  the  data 
which  do  not  affect  the  r-nearest  neighbors,  also  do  not  affect  the 

A 

d.ofcribution  of  9 . Finally,  we  will  call  an  estimation  rule  symmetric 
if,  for  any  given  X,  permutations  of  the  order  of  the  elements  of  D 


do  not  affect  6 . 

A . 

n 

An  example  of  a local  estimation  rule  which  is  symmetric  is  the 
k-nearest  neighbor  rule  when  F(x,9)  is  nonatomic.  Note  however, 
that  when  F(x,e)  possesses  atoms,  certain  difficulties  may  arise.  In 
some  cases  there  may  be  positive  probability  that  several  observations 
in  the  data  are  at  the  same  distance  from  X as  its  kth-nearest  neighbor. 
In  order  to  have  a local  rule  then,  it  is  necessary  to  devise  some 
method  to  prevent  some  of  these  observations  from  contributing  to  the 
estimate.  This  problem  will  be  discussed  in  more  detail  in  the  remarks 
?t  the  close  of  this  chapter. 

The  results  to  be  presented  are  for  two  different  estimates  of 

Lr,  which  will  be  referred  to  as  the  deleted  estimate  and  the  holdout 

estimate.  (Toussaint  provides  references  for  previous  work  on  both  of 

these  estimates.  In  Toussaint's  paper,  the  deleted  estimate  is  called 

the  U method,  and  the  holdout  estimate  is  called  the  H method.)  The 

deleted  estimate,  L , is  defined  as 
n 


(3.5) 


where  0.^  is  the  estimate  produced  by  6 when  X^  is  the  observation  and 

D as  defined  in  (3.4)  is  the  data.  The  deleted  estimate,  then,  is 
n i i 

just  the  average  loss  incurred  when  each  (X. , e.)  in  turn  is  deleted  from 

the  data  set  and  6 is  used  with  X.  and  the  remaining  n-1  elements  of  D 

i n 

to  estimate  6. . 

l 

The  holdout  estimate,  L'  , is  defined  as 

n 


i n 

L'  = - V L(e. , e*) 

n l ' vi  i' 


(3.6) 


n-£+l 


where,  for  n-X<isn,  0*  is  the  estimate  produced  by  6 with  observation 


X.  and  data  D , where 
i n-j e' 


D 

n-X 


(3.7) 


The  holdout  estimate  is  the  average  loss  when  the  first  n-x  elements  of 
D are  used  to  estimate  the  parameters  for  the  remaining  x observations. 
The  first  n -X  elements  of  are  usually  called  the  training  set,  and  the 
remaining  elements  are  known  as  the  test  set.  In  the  results  presented 
in  section  III.  2,  the  size  of  the  test  set,  x , will  be  allowed  to  increase 
with  n. 

III. 2 Distribution-Free  Bounds  for  Deleted  and  Holdout  Estimates 
Theorem  5 will  establish  a distribution-free  bound  on 

I A 

L - L a el/  while  Theorem  6 establishes  a similar  bound  on 
n n 1 
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Pf  IL  - L’  I s €}.  We  assume  tt  U (X,  0)  and  the  data  set  D are  as 
^ 1 n n 1 J n 

defined  previously,  and  let  (XQ,0O)  be  independent  of  (X,  e)  and  the 
data  and  distributed  as  F(x,  0) . The  result  given  by  Theorem  5 is  similar 
in  nature  to  a result  of  Rogers  and  Wagner  [9]  for  deleted  estimates  in 
discrimination  problems. 

Theorem  5:  Let  6 be  a symmetric  estimation  rule  which  is  local  with 
parameter  k.  Let  L be  a loss  function  which  satisfies 


sup  L(0,  0)  = M < * 

0,0 


Then  P{|Ln-Lj  s e}  £ 


M2  l+6k  ki 

2 n 2 
e L n 


(n-1)' 


(3.8) 


(3.9) 


Proof:  By  Chebychev‘3  inequality, 

Pf  |L  - L I s e)  sE(L  - L )2/e2 
L 1 n n 1 - n n 


2 A A 2 2 
= (EL*  - 2EL  L + ELVe 
n n n n 


(3.10) 


Let  0,  0Q,  and  6i  be  independent  random  vectors  where  0 has  distribution 

6X  D ' ®o  haS  distribution  6x  D ' and'  for  each  1 S S n'  has 
' n o'  n 

distribution  6„  _ 


UiUil  Ov  • 

i ' n , i 

E(L2)  = E[E2  (L  (© , i)/D  )] 
n n 


(3.11) 


= E[E(L(0,9)/Dn)E(L(0o,eo)/Dn)] 


= E[  L(0,0)L(0o,0q)] 


(3.12) 


E(inin)  = E|E[i.(9,8)/Dn] 

*n  ^ E|L(8.,e1)Ett(e,e)/Dn]j 


= i ± 

n 4-' 


e{e[l(61(  e.)L(e,e)/Dn]! 


- e[l(©^ , e^)L(e , e) ] . (3.14) 

The  last  step  above  was  made  possible  by  the  assumed  symmetry  of  6, 
which  implies  that 

EjE[L(ei,ei)L(e,§)/Dn]J  = EjE[L(erej)L(e,0)/Dn]J  (3.i5) 

for  all  1J.  Finally, 

E(V  =^E  E 1,2  (vV + EL(VVL(VV 

n 1_  1 i^j  11 

= j2  EeCl2<vM  + E ECL(e1,e.)L(e.,e.)] 
n [ 1 1 1 pj  i i J j 

= ~(e[l2(91  , ex)]  + (n-i)E[L(e1 , , e2)]j  , (3 . i6) 

where  the  last  step  is  due  to  symmetry.  Combining  (3.12),  (3.14),  and 
(3.16)  above,  we  have 

E(Ln-1n)2=  E£L(e,S)L(e0,e0)] 


-2E[L(e1,§1)L(e,e)] 

+ E[L(e1>01)L(e2,e2)] 


+ nlE[L2(ei'®i)]  " Ett-(e1 , e1)L(©2 , ©2)]|  . 


(3.17) 
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Let  D'  = (X  , 0 ) ,(X,  , 0J  , . . . , (X  , 0 ) and  let  e'  and  0'  be 
n 0 0 L L nn  0 

independent  random  vectors  which  are  also  independent  of  e,  eQ,  and 

9 conditioned  on  X,  (Xn,  0j , and  D . Let  0'  be  distributed  with 
l 0 0 n 

6X  D*  and  60  With  5X  D * Then  11  iS  not  difficult  to  see  that 
' n 0'  n, 1 

E[L(0,  0)L(0q , 0Q)]  - E[L(0,0)L(01<01)] 


= E[L(0.0)L(0o<0o)]  - E [L(0,  0*)L(0O,  0^)]  . 


(3.18) 


Now,  under  the  hypothesis  that  X,(X„,  0J , and  D are  such  that 

0 0 n 

6X  D = 6X  and  D = 6X  D , (3.18)  is  equal  to  zero  since 
' n ' n A0'  n A0'  n,l 

it  is  the  expectation  of  the  difference  of  two  identically  distributed  random 
variables.  Hence  (3.18)  can  be  bounded  from  above  by 


M EtIr«v„A,„.J  + I[«x  ,D^X  ,D  i 

O'  n 0'  n,  1 


} 


n 


n 


M tP{6X,D  ^6X,D-  }+  P{6X  D ^6X  D ^ ’ 
n n o n 0 n,  1 


(3.19) 


Next,  we  let  D‘  = (X,  0) , (X.  , 0 ),...,  (X  , 0 ) , and  let  0"  and  ©" 
n , i o 6 n n 1 

be  conditionally  independent  random  vectors  given  X and  D , and  let 

n 

0"  have  distribution  6 and  0*'  have  distribution  6V  . Then, 

n f z in 

using  the  same  techniques,  we  can  show  that 


E[L(01,01)L(02,02)]  - E[L(0,0)L(01,01)] 
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,M[Pt6  a )+P!6  D Ac  n.  }].  P-20) 

n,/  ln,i  1 n,  1 


n 


The  remaining  term  of  (3.17)  can  be  bounded  as  follows; 

^{E[L2(01/e1)  - L(e1,01)L(e2,e2)]7  sM2/n  . 
Now,  since  6 is  a local  rule, 

2 

Pf6X,Dn=6X,D^  2 (~) 


(3.21) 


(3.22) 


since  6y  n is  not  changed  if  X is  not  one  of  the  nearest  neighbors 
' n 1 


th 


to  X from  D^,  and  X^  is  not  closer  to  X than  its  k -nearest  neighbor 


from  D . From  (3.22), 
n 


Pf6X,D /6X,D'  ) S 1 " (V)  * 

n n 


Similarly, 


(3.23a) 


Pl\■D/‘X0■Dn,l, 


P(5X,d/8X,D  A 
n n,  L 

P{6X  ,D  /Sxi/Dn  i5  * 1 " ( n_1  ) ‘ 

i n f l 1 n f l 


(3.23b) 


(3.23c) 


(3.23d) 


Combining  (3.19),  (3.20),  (3.21),  and  (3.23),  we  arrive  at  the  statement 
of  the  theorem. 

For  the  holdout  estimate,  L'  , we  have  the  following  theorem. 

n 
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Theorem  6:  Let  6 be  a symmetric  estimation  rule  which  is  local  with 
parameter  k.  Let  L be  a loss  function  which  satisfies 

sup  L(e,  e)  = M < ® . 

0,0 


Then 


where  A is  the  number  of  samples  held  out. 


(3.24) 


Proof:  Let  L _ = E[L(e,  tf)/D  ] , where  6'  is  an  independently  chosen 
n-  x n 

random  vector  with  distribution  6 and 

n-£ 

Dn-H<XrV • 


Then, 

p[|Ln-L;|  =>  .}  ‘PflVVil  2 e/2’  + ptlL„-,-Lnl  * e/21‘ 

(3.25) 

The  first  term  on  the  right-hand  side  of  (3.25)  is  bounded  by  Markov’s 
inequality: 

P { I L -L  I a e/2}  i~E|L  -L  I 

M n n-A'  e 1 n n-i' 

= |e|E(L(0,6)  - L(e,0’)/Dn)  | . (3.26) 

By  the  same  argument  used  in  the  proof  of  Theorem  5,  (3.26)  can  be 
bounded  by 
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Mrrr 

■ ‘ [{X,D  *X.D  1 

n n-i 


} 


= 7p{«XD^D  } 
6 X,Dn  X'Dn-x 


(3.27) 


For  the  second  term, 


Note  that  for  each  i, 

E ( L ( 6 8’  ,)/D  ) = L . 

' n-i  n-i  n / n- 1 


Then,  by  Hoeffding’s  inequality  (10], 

p!|L  -V  I |s2e-*,2/4M. 
n-jt  n1  2 ) 


(3.28) 


The  inequality  in  (3.24)  now  follows  from  (3.25),  (3.27),  and  (3.28). 

For  the  k-local  rules,  P^l  ^6v  ^ 4 is  bounded  by  the 

X,Dn  *,Dn_A 

probability  that  not  all  of  the  k-nearest  neighbors  to  X are  in  the  first 


n-i  elements  of  D . Hence 
n 


P{9(n)  ^ e(n-jt) } si- 


(V) 

(2) 


£ 1 


-Rf- 


This  proves  the  theorem. 
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III.  3 Remarks 

Noting  that  both  of  the  theorems  in  this  chapter  give  bounds 
which  depend  on  M = sug  L(0, 0) , some  justification  would  appear  to 

e,  e 

be  in  order.  Consider  the  following  example  . Let  the  observations 

X,X,  , . . . ,X  take  values  in  IR,  and  let  © = [0,1],  ©'  = [0,m]  where 
1 n 

m > 1 . Suppose  that  (X,  0) , (X, , 9, ),...,  (X  ,9  ) are  independent  and 

i 1 n n 

identically  distributed  on  IR  x ® with  distribution  F(x,  ©) . Also,  suppose 

that  (X,  9') , (X. , 0' ),...,  (X  , 9' ) are  independent  and  identically  distri- 
11  n n 

buted  on  R x 0'  with  distribution  F(x,  0/m) . Let  L(0, 0)  = | 0-0  | be  the 
loss  function  for  both  parameter  spaces.  By  this  construction,  it  is  not 
difficult  to  show  that 

P{LnSx]  = P[L^mx}  (3.27) 

and  PfL  sx3  = P{L,  smx]  (3.28) 

n n 

* # 

where  L and  L are  associated  with  IR  x ® and  L'  and  L*  are  associated 
n n n n 

with  IR  x ©'.  To  see  this,  note  that  IR  x ©'  is  just  IR  x © with  a scale  factor 
added  to  ©.  Let  f be  the  obvious  one-one  capping  from  IR  x © onto  IR  x ©' . 
Now,  if  P is  the  probability  measure  on  IR  x © corresponding  to  F,  and  P' 
on  IR  x ©’  corresponds  to  F' , then  P[A}  = P'  [f(A)  } for  all  measurable 
A c IR  x ©.  From  this,  and  the  fact  that  the  same  loss  function  is  used 
on  ©and  ©',  (3.27)  and  (3.28)  follow  easily.  From  (3.27)  and  (3.28) 

E(L  -L  )2  = ~Y  E(L' -L' )2 
n n l n n 


(3.29) 
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r\ 

We  can  conclude  that  any  distribution-free  bound  on  E(L  -L  ) must 

n n 

2 A 

depend  on  M . Similarly,  a bound  on  E|L  -L  I must  depend  on  M, 

n n 1 

if  it  is  to  be  distribution- free , 

Finally,  we  note  that  if  F(x,  8)  possesses  atoms,  there  may  be 

positive  probability  that  more  than  one  observation  is  at  the  same  distance 
tfc 

from  X as  its  k -nearest  neighbor.  In  order  to  preserve  the  local  nature 

of  the  rule,  it  will  be  necessary  to  break  the  tie  in  distance  in  order 

to  use  only  k neighbors  in  the  estimate  of  e.  This  may  be  done  by 

generating  a sequence  of  random  variables  Z. , . . . ,Z  which  are  inde- 

l n 

pendent  and  identically  distributed  uniformly  on  the  interval  (0,1). 

Ties  in  distance  can  then  be  broken  by  choosing  from  among  those 
tied,  the  observation  (or  observations,  if  necessary)  which  has  the 
smallest  Z.  associated  with  it. 

In  cases  where  F(x,  8)  is  known  to  be  atomic,  a reasonable 
estimation  rule  would  attempt  to  use  only  observations  which  lie  on 
the  same  atom  as  X in  forming  its  estimate  of  8.  This  is  because  the 
conditional  distribution  of  the  parameters  associated  with  one  atom 
may  be  totally  distinct  from  the  conditional  distribution  of  the  parameters 
associated  with  any  other  atom.  It  is  still  necessary  to  break  ties  in 
distance  since  more  than  k observations  from  the  data  may  lie  on  the 


same  atom  as  X. 
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The  bounds  presented  in  Theorems  5 and  6 are  limited  in  the  sense 

that  fairly  large  values  of  n are  necessary  to  produce  useful  results, 

especially  for  the  holdout  estimate.  The  values  of  l which  minimize 

n 

the  bound  in  Theorem  6 is  a complicated  function  of  n,  e,  m and  k, 
however  the  optimum  value  is  generally  near  Jn.  This  implies  that 
the  best  rate  of  decrease  for  the  bound  on  the  holdout  estimate  is  on  the 
order  of  \/Jn.  This  is  a factor  of  Jn  slower  than  the  bound  on  the 


deleted  estimate.  On  the  other  hand,  the  amount  of  computation  re- 
quired to  obtain  the  holdout  estimate  with  = „/n  is  a factor  of  l/Jn 
less  than  the  computation  required  to  obtain  the  deleted  estimate. 

This  suggests  that  the  holdout  estimate  would  only  be  practical  in 
cases  where  the  size  of  the  data  set  is  so  large  that  the  holdout  bounds 


are  useful  and  the  deleted  estimate  is  too  costly. 


IV.  COMPUTER  SIMULATION  RESULTS 


IV.  1 Remarks  Concerning  the  Experiments 

The  bounds  established  by  Theorems  5 and  6 are,  by  their  very 
nature,  extremely  general.  While  this  generality  is  a virtue  for  reasons 
already  mentioned,  it  also  can  be  expected  to  result  in  rather  loose 
bounds  for  a great  many  choices  of  the  underlying  distribution  F(x,  6). 

In  order  to  gain  some  feel  for  the  performance  of  the  deleted  and  holdout 
estimates  in  specific  examples,  some  computer  simulation  experiments 
were  performed. 

The  experiments  are  in  the  form  of  Monte  Carlo  studies , performed 

in  the  following  fashion.  For  a particular  distribution  F,  n independent 

pseudo-random  vectors  ((X.  , 9 ),...,  (X  ,0  ))  were  generated  having 

i 1 n n 

that  distribution.  Based  on  the  generated  data,  L and  its  deleted  and 

n 

holdout  estimates  were  computed  and  compared  for  k = 1,3, 5, 7 and  9 
nearest  neighbor  rules.  This  process  was  repeated  1200  times  for 
various  values  of  n.  For  certain  values  of  e,  the  relative  frequency  of 
the  event {IL  -L  I > efwas  returned  as  an  estimate  of  Pf  |L  -L  I > el. 

The  number  of  runs,  1200,  was  selected  to  guarantee  that  the  estimates 
produced  would  be  within  .03  of  the  correct  value  95%  of  the  time. 

The  experiments  were  performed  for  three  specific  distributions. 
The  first  is  actually  a discrimination  problem  where  © takes  only 
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the  values  0 or  1 with  equal  probability.  The  conditional  densities 
f(x/6=l)  and  f(x/0=O)  are  the  triangular  densities  for  which  Cover 
and  Hart  calculated  EL^  and  its  limit.  This  example  is  used  predominantly 
because  Cover  and  Hart's  results  are  quite  well  known.  The  second 
example  is  also  a discrimination  problem.  In  this  case,  e again  takes 
values  0 and  1 with  equal  probability,  but  f(x/0)  is  now  gaussian  with 
mean  0 and  variance  1.  This  example  is  included  mainly  for  comparison 
purposes  with  the  third  example , which  is  an  estimation  problem  with 
0 uniformly  distributed  on  the  interval  [0,1],  while  f(x/0)  is  again 
gaussian  with  mean  0 and  variance  1. 


IV. 2 Example  1:  Discrimination  with  Triangular  Densities 


The  triangular  densities  of  Cover  and  Hart  are  shown  in  Fig.  1. 
For  these  densities  and  the  discrimination  version  of  the  single-nearest 
neighbor  rule,  with  the  0-1  loss  function  defined  by 


- f0' 


0,  if  0 = 0 


otherwise , 


they  calculated 


EL  = i + - — 777 — - 
n 3 (n+l)(n+2) 


R*  = £ . 


In  Table  1 we  have  shown  the  average  value  of  Ln  obtained  for  k = 1,3,5, 


f (x/0  =1) 

f (x/0  =0) 


Figure  1.  Conditional  Densities  for  Example  1 


Table  1.  Experimentally  Obtained  Average  Values  of  Ln  for  Example  1 


n=20  50 


=1 

.3364 

.3337 

=3 

.3055 

.3010 

=5 

.2936 

.2867 

= 7 

.2911 

.2783 

=9 

.2933 

.2736 

100 

200 

400 

,3337 

.3331 

.3335 

3001 

.3002 

.3000 

2858 

.2853 

.2860 

2777 

.2773 

.2778 

2726 

.2722 

.2728 

7,  and  9 nearest  neighbors , and  n = 20,50,100,200,400.  Note  that 
the  average  of  seems  to  approach  its  limiting  value  quite  early. 

If  we  let  R(k)  denote  the  limit  of  ELn  for  the  k-nearest  neighbor  rule, 
then  Cover  and  Hart  have  shown,  for  discrimination  problems, 

R*  sR(k)  s (1+  l/k)R*  . (4.1) 

Substituting  the  values  of  R*  and  k used  for  Example  1 into  (4.1)  we  have 


.25  SR(1)  s .5 

(4.2a) 

.25  sR(3)  s:  .333 

(4.2b) 

.25  sR(5)  s .300 

(4.2c) 

.25  sR(7)  s .286 

(4. 2d) 

.25  S R(9)  s .278  . 

(4.2e) 

The  values  in  Table  1 satisfy  the  bounds  given  by  (4.2)  quite  easily. 

In  Figs.  2-6  we  have  shown,  for  the  same  values  of  k,  graphs 
ofP{|Ln~Lnj  > e)  versus  n for  e=  .025,  .05,  and  .1.  These  graphs 
are  for  the  deleted  estimate.  The  curves  for  the  various  values  of  k 
exhibit  somewhat  similar  behavior,  decreasing  quite  slowly  for  the 
smaller  values  of  e,  and  much  more  quickly  for  larger  values. 

The  theoretical  results  of  Rogers  and  Wagner  for  discrimination  , 
and  the  results  shown  here  in  Chapter  III  for  estimation , present  bounds 
which  deteriorate  as  k increases.  The  deterioration  of  the  performance 
of  the  deleted  estimate  as  k increases  is  not  in  evidence  in  Figs.  2-6, 
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even  for  the  case  where  k = 9 and  n is  twenty.  The  theoretical  bounds 

are  quite  pessimistic  for  the  example,  as  expected,  since  they  are 

equal  to  one  for  all  the  values  of  k,n,  and  e used.  The  deleted  estimate 

performs  well  enough  in  this  example  to  guarantee  that  the  estimate  will 

be  within  .05  of  the  true  value  of  Ln  approximately  93%  of  the  time  when 

the  data  contains  400  observations.  The  data  in  Table  I indicates  that  the 

average  value  of  L is  close  to  its  limiting  value  for  a much  smaller  number 
n 

of  observations,  indicating  that  the  additional  data  necessary  to  insure 
the  accuracy  of  Ln  may  not  yield  significant  improvement  in  Ln> 

IV.  3 Example  2:  Discrimination  with  Gaussian  Densities 

In  this  example  0 takes  values  0 and  1 with  equal  probability,  and 

f(x/0)  is  gaussian  with  mean  0 and  variance  1.  R*  for  this  example  is 

.3096.  Table  2 gives  the  average  value  of  L obtained  for  k = 1,3, 5, 7, 

n 

and  9 nearest  neighbors,  and  n = 20,50,100,200,  and  400.  Once  again, 
the  average  of  seems  to  be  already  quite  close  to  its  limit  when  n = 20. 
For  this  example,  the  bounds  on  R(k)  are  given  by  (4.3): 


.3096  s R ( 1 ) £ .5  (4.3a) 

.3096  s R(3)  £ .402  (4.3b) 

.3096  «R(5)  s .361  (4.3c) 

.3096  £ R(7)  £ .344  (4.3d) 

.3096  sR(9)  £ .334  . (4.3e) 


The  data  in  Table  2 approaches  these  bounds  closely. 


Table  2.  Experimentally  Obtained  Average  Values  of  Ln  for  Example  2 


n=20 

50 

100 

200 

400 

k=l 

.4036 

.4001 

.3987 

.3980 

.3981 

k=3 

.3820 

.3720 

.3692 

.3683 

.3680 

k=5 

.3722 

.3586 

.3541 

.3536 

.3531 

tv. 

it 

M 

.3683 

.3505 

.3456 

.3444 

.3445 

k=9 

.3730 

.3451 

.3403 

.3386 

.3387 

3 

In  Figs.  7-11  we  have  shown,  for  the  same  values  of  k,  graphs  j 

of  p { lL  “L  | > e } versus  n for  e = . 05  , . 1 , and  . 2 , where  L is  the 
n n n 

deleted  estimate.  Figs.  13-17  show  the  same  data  for  the  holdout 
es  timate . The  behavior  of  the  deleted  estimate  in  this  example  is  quite 
similar  to  Example  1,  although  the  values  for  e have  been  adjusted 
upward  slightly  indicating  the  slightly  more  complex  nature  of  this 
problem.  In  this  example,  increasing  k shows  some  slight  deterioration 
in  the  performance  of  the  deleted  estimate  when  n is  small,  but  for 
larger  values  of  n,  increasing  k actually  seems  to  improve  performance, 
if  anything.  A comparison  of  the  curves  for  the  holdout  estimate  with 
those  for  the  deleted  estimate  reveals  that  much  larger  values  of  n are 
needed  to  achieve  the  same  performance  levels , as  expected . 

Figs.  12  and  18  show  the  average  squared  error  of  the  deleted 
and  holdout  estimates,  respectively,  for  the  same  values  of  n as  before. 


The  performance  is  so  uniform  in  k that  only  one  curve  is  shown  in 


Squared 


Figs.  12  and  18.  This  curve  represents  all  k = 1,3,5, 7,  and  9 nearest 
neighbor  rules . The  rate  of  decrease  for  the  deleted  estimate  is  slightly 
faster  than  1/n  as  predicted  by  Rogers  and  Wagner.  Theorem  6 in  Chapter 
III  indicates  that  the  rate  of  decrease  for  the  squared  error  of  the  holdout 
estimate  should  be  better  than  l/Jn  , which  is  the  case  in  this  example. 

IV.  4 Example  3;  Estimation 

In  this  example,  0 is  uniformly  distributed  on  the  interval  [0,1], 

while  f(x/0)  has  a gaussian  density  with  mean  0 and  variance  1.  The 

example  can  be  considered  a model  of  a system  in  which  the  observation 

X is  simply  0 corrupted  by  zero  mean  gaussian  noise.  R*  was  computed 

numerically  for  the  squared-error  loss  function,  and  is  approximately 

equal  to  .076913.  The  limiting  values  of  L for  k = 1 , 3 , 5 , 7 , and  9 are 

n 

given  by  (4.4)  in  accordance  with  Theorem  3: 


R(l)  = . 1538  (4.4a) 

R(3)  = . 1025  (4.4b) 

R(5)  = . 0923  ' (4.4c) 

R(7)  = . 0879  (4. 4d) 

R(9)  = .0855  . (4.4e) 


Table  3 shows  the  average  values  of  L for  k = 1 , 3 , 5 , 7 and  9 , and 

n 

n = 20,50,100,200,400.  It  is  again  apparent  that  EL  approaches  its 

n 

limit  fairly  closely  even  for  n = 20.  The  values  for  n = 400  are  comparable 


j 


i 


w 


to  the  limits  predicted  by  (4.4).  Note  that  in  this  example  the  average 
risk  for  k = 1 is  almost  cut  in  half  by  the  use  of  9 neighbors  for  all  the 
values  of  n shown. 
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Table  3.  Experimentally  Obtained  Average  Values  of  L for  Example  3 

n 


n=20 

n=50 

n=100 

n=200 

n=400 

p—4 

II 

.1550 

.1541 

.1539 

.1536 

.1535 

k=3 

.1031 

.1025 

.1023 

.1024 

.1024 

k=5 

.0932 

.0923 

.0921 

.0921 

.0922 

k=7 

.0892 

.0880 

.0877 

.0877 

.0878 

k=9 

.0873 

.0857 

.0853 

.0853 

.0853 

Figs. 

19-23  show  P{ 

|L  -L  | > 
1 n n 1 

e}  for  e = . 01 , 

.025,  and  .05, 

where 

L is  the  deleted  estimate.  Figs.  25-29  show  the  same  data  for  the 
n 

holdout  estimate.  The  curves  are  similar  to  those  of  the  preceding 
examples,  and  the  same  comments  apply,  with  the  following  exception. 
The  performance  of  the  estimates  improves  markedly  as  k increases  for 
all  values  of  n and  e,  for  both  the  deleted  and  holdout  estimates.  This 
is  in  contrast  to  the  deterioration  in  performance  predicted  by  Theorems 
5 and  6 for  increasing  k.  A partial  explanantion  for  this  appears  to  be 
in  the  uniform  distribution  chosen  for  9.  For  example,  if  we  had  chosen 
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then  the  performance  of  the  estimates  would  be  expected  to  approach 
that  of  Example  2 as  a becomes  small. 

Figs.  24  and  30  show  the  average  squared  error  of  the  deleted 
and  holdout  estimates  respectively,  for  the  same  values  of  n and  e. 

The  curves  are  again  similar  to  those  of  Example  2.  Once  again,  the 
rate  of  decrease  is  approximately  as  predicted  by  Theorems  5 and  6, 
at  1/n  for  the  deleted  estimate  and  l//n  for  the  holdout  estimate. 

IV.  5 Conclusions 

The  conclusions  to  be  drawn  from  these  examples  must  be  viewed 
in  light  of  the  fact  that  they  are  based  on  extremely  limited  evidence. 
However,  the  examples  used  are  fairly  general,  and  the  behavior  ex- 
hibited should  carry  over  to  many  similar  problems . These  problems  are 
characterized  not  so  much  by  the  distribution  of  6,  which  is  discrete  in 
Example  2 and  continuous  in  Examples  1 and  3 , but  by  the  smooth  dis- 
tribution of  x given  9. 

The  most  obvious  suggestion  made  by  these  examples  is  that 

I 

fairly  large  values  or  k can  be  used  without  risking  a reduction  in  per- 
formance either  of  the  rule  or  of  the  deleted  or  holdout  estimates  of  the 
risk.  In  fact,  for  the  estimation  problem  in  Example  3,  larger  values  of 

k not  only  improved  L , but  they  also  improved  the  deleted  and  holdout 

n 

estimates.  For  Example  3,  both  Ln  and  the  estimate  of  Ln  were  still 


improving  with  k = 9 in  the  case  where  n = 20. 
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We  also  note  that  in  every  case,  the  average  value  of  L was 

n 

quite  near  its  limiting  value  long  before  the  deleted  estimate  was 
achieving  acceptable  performance  levels . Although  this  does  not  indi- 
cate that  Ln  itself  is  converging  that  rapidly,  it  does  seem  to  offer  some 
evidence  that  may  be  converging  somewhat  more  rapidly  than  the 
error  estimates.  It  may  be  possible  to  take  advantage  of  this  to  improve 
the  performance  of  the  holdout  estimate  by  holding  out  a larger  portion 
of  he  data.  In  the  examples  above,  i = ,/n  observations  were  used 
to  compute  the  holdout  estimate.  In  cases  where  converges  quickly, 

L will  be  close  to  L for  larger  values  of  l.  Since  the  holdout 
n-  a n 

estimate  is  essentially  an  unbiased  estimate  of  L , and  the  variance 

n-  i 

of  the  estimate  declines  with  i,  performance  could  be  improved  by 
increasing  SL. 

Finally,  we  note  that  for  each  value  of  k,  n,  and  e used  in  all 
the  above  examples,  the  bounds  given  by  Theorems  5 and  6 are  equal 
to  one,  and  so  are  extremely  pessimistic,  as  expected.  However, 
the  rate  of  decrease  of  the  average  squared  error  is  consistent  with 
the  rate  of  decrease  of  the  theoretical  bounds. 


9 


V.  SUMMARY 


An  attempt  has  been  made  to  identify  the  questions  which  are 
of  genuine  importance  to  a statistician  who  is  interested  in  evaluating 
nonparametric  estimation  rules.  It  was  pointed  out  that  asymptotic 
performance  is  an  important  criterion  in  many  cases , even  though  data 
sets  are  always  finite  in  practical  situations.  Considering  the  reasons 
behind  interest  in  asymptotic  results,  it  appears  that  the  important 
question  in  this  area  concerns  what  statements  can  be  made  about  the 
performance  of  the  rule  as  the  size  of  a single  data  set  is  increased. 

The  importance  of  estimating  finite  sample  performance  was  discussed 
and  the  value  of  a distribution-free  bound  on  the  error  of  such  estimates 
was  indicated. 

The  asymptotic  performance  of  various  nearest  neighbor  rules 
and  loss  functions  was  analyzed,  and  the  results  concerning 
the  convergence  of  the  conditional  risk  are  quite  strong  considering  the 
simple  nature  of  the  rules  and  the  minimal  assumptions  made  concerning 
the  problem  structure.  In  addition,  distribution-free  bounds  on  the  error 
of  two  different  estimates  of  finite  sample  performance  are  derived  for 
a class  of  estimation  rules  which  include  k-nearest  neighbor  rules. 

The  distribution-free  nature  of  these  bounds  indicates  that  they  are 
undoubtedly  not  tight  for  many  specific  choices  of  the  distribution 
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